Best Practice: AWS ftp with file processing

Best Practice: AWS ftp with file processing - amazon-web-services

I'm looking for some direction on an AWS architictural decision. My goal is to allow users to ftp a file to an EC2 instance and then run some analysis on the file. My focus is to build this in as much a service-oriented way as possible.. and in the future scale it out for multiple clients where each would have there own ftp server and processing queue with no co-mingling of data.
Currently I have a dev EC2 instance with vsftpd installed and a node.js process running Chokidar that is continuously watching for new files to be dropped. When that file drops I'd like for another server or group of servers to be notified to get the file and process it.
Should the ftp server move the file to S3 and then use SQS to let the pool of processing servers know that it's ready for processing? Should I use SQS and then have the pool of servers ssh into the ftp instance (or other approach) to get the file rather than use S3 as a intermediary? Are there better approaches?
Any guidance is very much appreciated. Feel free to school me on any alternate ideas that might save money at high file volume.

I'd segregate it right down into small components.
Load balancer
FTP Servers in scaling group
Daemon on FTP Servers to move to S3 and then queue a job
Processing servers in scaling group
This way you can scale the ftp servers if necessary, or scale the processing servers (on SQS queue length or processor utilisation). You may end up with one ftp server and 5 processing servers, or vice versa - but at least this way you only scale at the bottleneck.
The other thing you may want to look at is DataPipeline - which (whilst not knowing the details of your job) sounds like it's tailor made for your use case.
S3 and queues are cheap, and it gives you more granular control around the different components to scale as appropriate. There are potentially some smarts around wildcard policies and IAM you could use to tighten the data segregation too.

Ideally I would try to process the file on the server that it is currently placed.
This will save a lot of network traffic and CPU load.
However if you want one of the servers to be like reverse proxy and to load balance between farm of servers, then I will notify the server with http call that the file has arrived. I would made the file available via ftp since you already has working vsftp that will not be a problem and will include the file ftp url in the http call, so the server that will do the processing can get the file and start working on it immediately.
This way you will save money by not using any extra S3 or SQS or any other additional services.
If the farm of servers are made of equal type of servers, then the algorithm for distributing the load should be RoundRobin if the servers are with different capacity then the load distribution should be made according to the server performance.
For example if server ONE has 3 times more performnce then server THREE and server TWO has 2 times better performance than server THREE, then you can do:
1: Server ONE - forward 3 request
2: Server TWO - forward 2 request
3: Server THREE - forward 1 request
4: GOTO 1
Ideally there should be feedback from the serves that report the current load so the load-balancer knows who is the best candidate for the next request instead of using hard-coded algorithms, since probably the requests do not need exactly equal amount of resources to be processed, but this start looking like Map-Reduce paradigm and is out of the scope ... at least for the begining. :)

If you want to stick with S3, you could use RioFS to mount S3 bucket as a local filesystem on your FTP and processing servers. Then you could do the usual file operations (e.g.: get the notification when a file was created / modified).

As well as RioFS s3fs-fuse utilizes FUSE to provide a filesystem that is (virtual locally) mountable; s3fs-fuse is currently well-maintained.
In contrast the Filesystem Abstraction for S3, HDFS and normal filesystem
swineherd-fs allows to have a different (locally virtual) approach:
All filesystem abstractions implement the following core methods, based on standard UNIX functions, and Ruby's File class [...].
As the 'local abstraction layer' has been Only thoroughly tested on Ubuntu Linux i'd personally go for a more mainstream/solid/less experimental stack, i.e.:
a (sandboxed) vsftpd for
FTP transfers
(optionally) listen for filesystem changes and finally
trigger
middleman-s3_sync to do the cloud lift (or synchronize all by itself).
Alternatively, and more experimental, there are some github projects that might fit it:
s3-ftp: FTP server front-end
that forwards all uploads to an S3 bucket (Clojure)
ftp-to-s3: An FTP server that
uploads every file it receives to S3 (Python)
ftp-s3: FTP frontend to S3 in Python.
Last but not least i do recommend using donationware Cyberduck if on OSX - a comfortable (and very FTP-like) client interfacing S3 directly. For Windows there is a (PRO optional) freeware named S3 Browser.

Related

Architecture Design for API of Cloud Service

Background:
I've a local application that process the user input for 3 second (approximately) and then return an answer (output) to the user.
(I don't want to go into details about my application in purpose of not complicate the question and keep it a pure architectural question)
My Goal:
I want to make my application a service in the cloud and expose API
(for the upcoming website and for clients that will connect the service without install the software locally)
Possible Solutions:
Deploy WCF on the cloud and use my application there, so clients can invoke the service and use my application on the cloud. (RPC style)
Use a Web-API that will insert the request into queue and then a worker role will dequeue requests and post the results to a DB, so the client will send one request for creating a request in the queue, and another request for getting the result (which the Web-API will get from the DB).
The Problems:
If I go with the WCF solution (#1) I cant handle great loads of requests, maybe 10-20 simultaneously.
If I go with the WebAPI-Queue-WorkerRole solution (#2) sometimes the client will need to request the results multiple times its can be a problem.
If I go with the WebAPI-Queue-WorkerRole solution (#2) the process isn't sync, the client will not get the result once the process of his request is done, he need to request the result.
Questions:
In the WebAPI-Queue-WorkerRole solution (#2), can I somehow alert the client once his request has processed and done ? so I can save the client multiple request (for the result).
Asking multiple times for the result isn't old stuff ? I remmemeber that 10 - 15 years ago its was accepted but now ? I know that VirusTotal API use this kind of design.
There is a better solution ? one that will handle great loads and will be sync or async (returning result to the client once it done) ?
Thank you.

If you're using Azure, why not simply fire up more servers and use load balancing to handle more load? In that way, as your load increases, you have more servers to handle the requests.
Microsoft recently made available the Azure Service Fabric, which gives you a lot of control over spinning up and shutting down these services.

Jingle (XEP-0166): Does any Multimedia Data go via my server, and if not, who is billed for the data?

I am running an Openfire server on a AWS EC2 instance and am able to connect to the server from my mobile devices and send messages back and forth. Of course, since XMPP is a client-server based protocol, I incur costs for running this traffic over the AWS server. However, for most use cases, this cost is not very high at all, as normal XMPP stanzas rarely seem to go above ca. 1 KB, so from this end all is ok.
I would now, however, like to include the ability to send images from one client to another. One way would be to use an HTTP server, to which user A uploads the picture and then sends the URL of the image to user B via XMPP, so that the user can now get the image via HTTP. There are also several other methods for sending images via XMPP. However, I am interested in doing this via Jingle.
As far as as I understand, Jingle is an out of band peer-to-peer extension to XMPP. My simple question is, since Jingle communicates peer-to-peer, i.e. without the use of a server, for the multimedia aspect of the session, will I even incur any data cost on AWS for transferring multimedia from one client to another using Jingle? Or put differently, if Jingle is peer-to-peer, does any data go via my AWS server using Jingle (except the session initiate, ack, session terminate stanzas)? If not, what kind of route does this data take, and how can anyone be billed for this traffic cost, if it is peer-to-peer?

Jingle is a negotiation mechanism, and there are a couple of different transports it could negotiate for file transfer. The most common transport is peer to peer bytestreams defined in http://xmpp.org/extensions/xep-0260.html - here the only traffic you'd see via the server would be the jingle negotiation, which is a similar sort of volume to other XMPP traffic). There is also an in-band bytestream transport defined in http://xmpp.org/extensions/xep-0261.html that some clients will use - typically for smaller transfers as it's inefficient, but has the advantage of working in hostile networks with NAT and firewalls. If you control the clients, simply not supporting IBB would be your best bet for ensuring the traffic doesn't travel via the server. If you don't, I'd suggest configuring your server to block IBB traffic.
I note as well that running a server-side proxy will drastically increase the odds of the out-of-band mechanism in 260 working in the face of hostile networks, at the cost of server bandwidth.
There is also the not-widely-deployed http://xmpp.org/extensions/xep-0343.html out of band transport.

Handling Uploads on a Django Site Behind a Load Balancer

I have a small- to medium-sized Django project where the client has been forced to change hosts. The new host convinced them they definitely needed a couple of web servers behind a load balancer (and to break the database off to a third server). I have everything ported over to the new setup, but I can't make it live yet as I'm not sure what's the best way to handle file uploads on the site as they will only get pushed up to the server the user is currently connected to. Given the three servers (counting the db which could double as a static file server if I had to), what's the cleanest and easiest way to handle this situation?

Simple solution which has some latency though and is not scalable beyond several servers - use rsync between hosts. Simply add it to cron to do upload dir sync both ways, also sticky session would help here - so that uploader would see their file as available immediately, and other visitors will be able to get the file after the next rsync completes.
This way you also get a free backup.
/usr/bin/rsync -url --size-only -e "ssh -i servers_ssh.key" user#server2:/dir /dir
(you'll have to have this in cron on both servers)

Move to 2 Django physical servers (front and backend) from a single production server?

I currently have a growing Django production server that has all of the front end and backend services running on it. I could keep growing that server larger and larger, but instead I want to try and leave that main server as my backend server and create multiple front end servers that would run apache/nginx and remotely connect to the main production backend server.
I'm using slicehost now, so I don't think I can benefit from having the multiple servers run on an intranet. How do I do this?

The first step in scaling your server is usually to separate the database server. I'm assuming this is all you meant by "backend services", unless you give us any more details.
All this needs is a change to your settings file. Change DATABASE_HOST from localhost to the new IP of your database server.
If your site is heavy on static content, creating a separate media server could help. You may even look into a CDN.

The first step usually is to separate the server running actual Python code and the database server. Any background jobs that does processing would probably run on the database server. I assume that when you say front end server, you actually mean a server running Python code.
Now, as every request will have to do a number of database queries, latency between the webserver and the database server is very important. I don't know if Slicehost has some feature to allow you to create two virtual machines that are "close" in terms of network latency(a quick google search did not find anything). They seem like nice guys, so maybe you could ask them if they have such a service or could make an exception.
Anyway, when you do have two machines on Slicehost, you could check the latency between them by simply pinging between them. When you have the result you will probably know if this is at all feasible or not.
Further steps depends on your application. If it is media heavy, then maybe using a separate media server would make sense. Otherwise the normal step is to add more web servers.
--
As a side note, I personally think it makes more sense to invest in real dedicated servers with dedicated network equipment for this kind of setup. This of course depends on what budget you are on.
I would also suggest looking into Amazon EC2 where you can provision servers that are magically close to each other.

Ideal way/architecture to deliver large data over Web Services

We are trying to design 6 web services, which will serve another client component. The client component requires data from the web service we are implementing.
Now, the problem is, there is not 1 Web Service we are implementing, there is one Web Service which the client component hits, this initiates a series (5 more) of Web Services which gather data from their respective data stores and finally provide the data back to the original Web Service, which then delivers the data back to the client component.
So, if the requested data becomes huge, then, this will be a serious problem for our internal communication channel.
So, what do you guys suggest? What can be done to avoid overloading of the communication channel between the internal Web Service and at the same time, also delivering the data to the client component.
Update 1
Using 5 WS, where, 1WS does not know about the others, except the next one is a business requirement. Actually, 5 companies "small services" are being integrated.
We use Java and Axis2

We've had a similar problem. Apart from trying to avoid it (eg for internal communication go direct to db instead of web service) you can mitigate it by at least not performing the 5 or so tasks in series. Make new threads to collect them all in parallel and process them at the end to reduce latency (except where they might contend for the same resource and bottle neck).
But before I'd do anything load test it and see if it is even an issue and get some baseline stats so you can see what improvement each change makes. Also sometimes you might be better off tweaking network settings or the actual network rather than trying to optimise the code - but again test and see.

Put all the data on a temporary compressed file and give back the ftp url of the file.
The client fetches the big data chunk uncompress it and reads it. (maybe some authentication mechanism for the ftp server)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js