Moving files directly from S3 to FTP

Moving files directly from S3 to FTP - amazon-web-services

I am having a media based web application running on AWS (EC2 windows). And I'm trying to achieve scalability by adding the app and web servers on an auto scaling group.
My problem is I need to separate the media storage to S3 so that I can share this with different app server clusters. But I have to move these media files from S3 to different FTP servers. For that I have to download the files from S3 to app server and then do the FTP upload which is taking too much time process. Note that I am using ColdFusion as application server.
Now I have 2 options to solve this
Mount the S3 instance to EC2 instances (I know that is not recommenced, also not sure if that will help to improve the speed of FTP upload).
Use Lamda service to upload files directly from S3 to FTP servers
I can not use separate EBS volume to each of the EC2 instance because
The storage volume is huge and it will result in high cost
I need to sync the media storage on different EBS volumes attached to the EC2 instances
EFS is not an option as I'm using windows storage.
Can any one suggest better solution?

That is pretty easy with python
from ftplib import FTP
from socket import _GLOBAL_DEFAULT_TIMEOUT
import urllib.request
class FtpCopier(FTP):
source_address = None
timeout = _GLOBAL_DEFAULT_TIMEOUT
# host → ftp host name / ip
# user → ftp login user
# password → ftp password
# port → ftp port
# encoding → ftp servver encoding
def __init__(self, host, user, password, port = 21, encoding = 'utf-8'):
self.host = host
self.user = user
self.password = password
self.port = port
self.connect(self.host, self.port)
self.login(self.user, self.password, '')
self.encoding = encoding
# url → any web URL (for example S3)
# to_path → ftp server full path (check if ftp destination folders exists)
# chunk_size_mb → data read chunck size
def transfer(self, url, to_path, chunk_size_mb = 10):
chunk_size_mb = chunk_size_mb * 1048576 # 1024*1024
file_handle = urllib.request.urlopen(url)
self.storbinary("STOR %s" % to_path, file_handle, chunk_size_mb)
Use example:
ftp = FtpCopier("some_host.com", "user", "p#ssw0rd")
ftp.transfer("https://bucket.s3.ap-northeast-2.amazonaws.com/path/file.jpg", "/path/new_file.jpg")
But remember that lambda process time is limited to 15 minutes. So timeout may appear before file transfer completed. I recommend to use ECS Fargate instead lambda. That allows to hold running process as long as you want.
If S3 file is not public, use presigned URLs to access it via urllib.
aws s3 presign s3://bucket/path/file.jpg --expires-in 604800

Related

Copy file from windows remote server to GCS bucket using Airflow

file_path = "\\wfs8-XXXXX\XXXX"
The files is on remote server path, I am using cloud composer to automate my data pipeline,
so How I will be able to copy the files from remote windows server to GCS bucket using composer ?
I tried to use LocalFilesystemToGCSOperator, but I am not able to provide any connection option to connect windows remote server, please advise
upload_file = LocalFilesystemToGCSOperator(
task_id="upload_file",
src=PATH_TO_UPLOAD_FILE,
dst=DESTINATION_FILE_LOCATION,
bucket=BUCKET_NAME,
)

In this case you can use SFTPToGCSOperator operator, example :
copy_file_from_sftp_to_gcs = SFTPToGCSOperator(
task_id="file-copy-sftp-to-gcs",
sftp_conn_id="<your-connection>",
source_path=f"{FILE_LOCAL_PATH}/{OBJECT_SRC_1}",
destination_bucket=BUCKET_NAME,
)
You have to configure the sftp connection in Airflow, you can check this topic to have an example.

Restart EC2 instance on Website unavailability

I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?

This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).

John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.

It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot

Does AWS CPP S3 SDK support "Transfer acceleration"

I enabled "Transfer acceleration" on my bucket. But I dont see any improvement in speed of Upload in my C++ application. I have waited for more than 20 minutes that is mentioned in AWS Documentation.
Does the SDK support "Transfer acceleration" by default or is there a run time flag or compiler flag? I did not spot anything in the SDK code.
thanks

Currently, there isn't a configuration option that simply turns on transfer acceleration. You can however, use endpoint override in the client configuration to set the accelerated endpoint.

What I did to enable a (working) transfer acceleration:
set in the bucket configuration on the AWS panel "Transfer Acceleration" to enabled.
add to the IAM user that I use inside my C++ application the permission s3::PutAccelerateConfiguration
Add the following code to the s3 transfer configuration (bucket_ is your bucket name, the final URL must match the one shown in the AWS panel "Transfer Acceleration"):
Aws::Client::ClientConfiguration config;
/* other configuration options */
config.endpointOverride = bucket_ + ".s3-accelerate.amazonaws.com";
Ask for acceleration to the bucket before transfer... (docs in here )
auto s3Client = Aws::MakeShared<Aws::S3::S3Client>("Uploader",
Aws::Auth::AWSCredentials(id_, key_), config);
Aws::S3::Model::PutBucketAccelerateConfigurationRequest bucket_accel;
bucket_accel.SetAccelerateConfiguration(
Aws::S3::Model::AccelerateConfiguration().WithStatus(
Aws::S3::Model::BucketAccelerateStatus::Enabled));
bucket_accel.SetBucket(bucket_);
s3Client->PutBucketAccelerateConfiguration(bucket_accel);
You can check in the detailed logs of the AWS sdk that your code is using the accelerated entrypoint and you can also check that before the transfer start there is a call to /?accelerate (info)

What worked for me:
Enabling S3 Transfer Acceleration within AWS console
When configuring the client, only utilize the accelerated endpoint service:
clientConfig->endpointOverride = "s3-accelerate.amazonaws.com";
#gabry - your solution was extremely close, I think the reason it wasn't working for me was perhaps due to SDK changes since originally posted as the change is relatively small. Or maybe because I am constructing put object templates for requests used with the transfer manager.
Looking through the logs (Debug level) the SDK automatically concatenates the bucket used in transferManager::UploadFile() with the overridden endpoint. I was getting unresolved host errors as the requested host looked like:
[DEBUG] host: myBucket.myBucket.s3-accelerate.amazonaws.com
This way I could still keep the same S3_BUCKET macro name while only selectively calling this when instantiating a new configuration for upload.
e.g.
<<
...
auto putTemplate = new Aws::S3::Model::PutObjectRequest();
putTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->putObjectTemplate = *putTemplate;
auto multiTemplate = new Aws::S3::Model::CreateMultipartUploadRequest();
multiTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->createMultipartUploadTemplate = *multiTemplate;
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
auto transferHandle = transferMgr->UploadFile(localFile, S3_BUCKET, s3File);
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
...
>>

How to set DNS server and network interface to boto3?

I would like to upload files to S3 using boto3.
The code will run on a server without DNS configured and I want that the upload process will be routed through a specific network interface.
Any idea if there's any way to solve these issues?

1) add the end point addresses for s3 to /etc/hosts, see this list http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
2) configure a specific route to the network interface, see this info on superuser
https://superuser.com/questions/181882/force-an-application-to-use-a-specific-network-interface

As for setting a network interface, I did a workaround that allows to set the source ip for each connection made by boto.
Just change awsrequest.py AWSHTTPConnection class with the following:
a) Before init() of AWSHTTPConnection add:
source_address = None
b) Inside the init() add:
if AWSHTTPConnection.source_address is not None:
kwargs["source_address"] = AWSHTTPConnection.source_address
Now, from your code you should do the following before you start using boto:
from botocore.awsrequest import AWSHTTPConnection
AWSHTTPConnection.source_address = (source_ip_str, source_port)
Use source_port = 0 in order to let OS choose random port (you probably want this option, see python socket docs for more details)

boto.sqs connect to non-aws endpoint

I'm currently in need of connecting to a fake_sqs server for dev purposes but I can't find an easy way to specify endpoint to the boto.sqs connection. Currently in java and node.js there are ways to specify the queue endpoint and by passing something like 'localhst:someport' I can connect to my own sqs-like instance. I've tried the following with boto:
fake_region = regioninfo.SQSRegionInfo(name=name, endpoint=endpoint)
conn = fake_region.connect(aws_access_key_id="TEST", aws_secret_access_key="TEST", port=9324, is_secure=False);
and then:
queue = connAmazon.get_queue('some_queue')
but it fails to retrieve the queue object,it returns None. Has anyone achieved to connect to an own sqs instance ?

Here's how to create an SQS connection that connects to fake_sqs:
region = boto.sqs.regioninfo.SQSRegionInfo(
connection=None,
name='fake_sqs',
endpoint='localhost', # or wherever fake_sqs is running
connection_cls=boto.sqs.connection.SQSConnection,
)
conn = boto.sqs.connection.SQSConnection(
aws_access_key_id='fake_key',
aws_secret_access_key='fake_secret',
is_secure=False,
port=4568, # or wherever fake_sqs is running
region=region,
)
region.connection = conn
# you can now work with conn
# conn.create_queue('test_queue')
Be aware that, at the time of this writing, the fake_sqs library does not respond correctly to GET requests, which is how boto makes many of its requests. You can install a fork that has patched this functionality here: https://github.com/adammck/fake_sqs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js