Trigger an instance startup, receive files from FTP, process, and upload them - amazon-web-services

I'm using an Amazon compute instance with Windows Server 2012 R2 to run some executable I own for data processing.
Right now, what I do it to send my data via FTP (I set up an FTP server on the remote Windows machine), and manually start the data processing. When the processing is completed, I download the outputs back from FTP and manually stop the remote Amazon computing instance.
I want to automate this process. Namely, I want to find a way to automatically start the remote machine when I start sending my data, then automatically trigger the processing (this I can handle via scripting), and then send back the data and shut down the machine automatically (this I think I also can handle).
So, to sum up, I need to know how can I automatically start the machine when I send my data to it.
I am using an FTP server on that machine and an EBS drive, but there may be a better way. Also, does anyone have any more suggestions on this setup?
Thank you

There are many ways to automate this. Is your control machine (from where you will be controlling the EC2 instance) a linux or windows machine?
Ansible: It is the easiest and the most straightforward if you are familiar with ansible. Barely 20 lines of code to achieve what you want. And it is free. You will be using EC2 module to start/stop your instances and one of many modules to transfer files. However, there is a bit of learning.
AWS CLI: A one line command to start (or stop) your instance. Once the instance is up and running, you can automate the file transfer part

Related

Google Cloud Platform jupyter notebook still runnig after off local PC

I'm new at GCP and I'm trying to keep my process running on Jupyter Notebook after shutting down my local PC. Does anyone know how can I do it? Nowaday I open a terminal on my VM run jupter notebook and then after start the process on jupyter I'd like to turn my machine off.
I keep following the process on my cellphone and shutdown on there. Does anyone know how to turn this off automatically when it stops?
Sorry to make two questions at once, but I think that one is related with another. If it does not I can edit and make another one.
This is a technical limitation of Jupyter Notebooks unfortunately. The browser window contains the code which updates the notebook itself, so if you close the browser window then there is not process running to update the notebook.
However, there is one workaround which you may find useful.
There is a library called Fairing that you can use with GCP's new AI Platform Notebooks which allows you to pack up your notebook and run it remotely, and that library will save the results of that execution in a GCP Storage bucket. No active internet connection required (once you kick of the notebook run).
You can learn how to use it by creating a new GCP AI Platform Notebook and looking at the tutorials folder inside it. You can also find additional tutorials for Fairing here
Typically to keep your remote sessions up in the event of network connectivity loss (which also covers shutting down the local computer) you'd use a terminal multiplexer application. From Known issues:
Intermittent disconnects: At this time, we do not offer a specific SLA for connection lifetimes. Use terminal multiplexers like tmux
or screen if you plan to keep the terminal window open for an
extended period of time.
But these multiplexers are terminal/text-mode apps, so you'd have to launch the notebook with the --no-browser and then connect your local browser to its port.
You can find a recipe based on tmux and a local browser connection to the notebook using an SSH tunnel at Using Jupyter notebooks securely on remote linux machines.
As for shutting down the session - you'd just have to instruct the multiplexer application to end the session (or terminate the multiplexer app itself) - which you could do automatically via a wrapper script first invoking your process and immediately after the process ends invoking the commands to shutdown the session.

Starting sas job from remote computer

I have a scheduling program running on Server A running Windows 2008 RS. Server B is my SAS server under Windows 2008 R2. How do I kick-off a job on SAS server from my scheduling server? I can either use the sas.exe or a batch file to start my job. Owners of the SAS server tell me that I cannot add an application or Windows service to the SAS server. Is this even possible?
Below is a copy of my answer to a slightly different question (source: http://www.runsubmit.com/questions/260/hide-sas-batch-jobs-winxp). I'm copy/pasting it here for perpetuity and also because it's more likely to help people searching:
You can use PsExec which is part of Microsoft/Sysinternals list of utility programs. This file will go on the scheduling server. Grab it from here:
http://technet.microsoft.com/en-us/sysinternals/bb897553.aspx
The tool is designed to allow you to execute jobs on remote machines. For example, if you want to launch a SAS program from the command line you could run:
psexec \\machinename sas.exe -sysin remotedrivename:\remotefolder\myprogram.sas
This would launch SAS.EXE on the remote machine and run the supplied program that exists on the remote machine. When it launches SAS it appears to launch it within a PsServ service. Because it's running within a service no interface will be displayed. I'm not even sure if you would see it appear as it's own process or application in windows task manager. If you use SysInternals other program, ProcessExplorer, instead of Task Manager you can see this happening.
Note that the REMOTE MACHINE and the LOCAL machine can be the same machine.
PROS: Many other uses for this technique. It's free. PsExec is only required on the machine that is making the call, not both machines.
CONS: Its a bit of a roundabout way to do things. Need to install a third party program (although it is now a MS tool). Some antivirus programs/network admins may not allow it.
Note that if your SAS jobs access network resources then you will probably need to make the network resource available first using the net map command. I suggest running your sas job in a batch file like so (or use the 'x' command from within your SAS file to call the 'net use' commands):
Command executed from local machine:
psexec \\machinename -sysin remotedrivename:\remotefolder\myprogram.BAT
Contents of batch file on remote machine:
net use m: \\fileserver\sharedfolder /USER:mynetworkdomainname\myusername mypassword
sas.exe -sysin remotedrivename:\remotefolder\myprogram.sas
net use m: /delete

Remote launching C++ apps

My problem is simple, I have 1 computer conected to many powerfull servers. I want to execute the app locally but run the process (heavy load) in the remote servers.
The app+settings vary a lot, and I want that this exactly version of the app+settings folder to be used by the remote instances.
My approach so far:
Launch the app locally
Use PSEXEC to remote launch the same executable as it is running in local -> in the servers (with a random port number passed by argument)
Contect to them via sockets
Send commands to execute remotely and get the results
My problem relies in the config files, wich are many(50+) and some of them +4MB. This config files are TXT files in a config folder.
What is the proper way to do it? Is it possible to use PSEXEC to copy remotely a whole folder? Can I do any good trick on the sockets to directly pass a copy of the local files to remote?
I would like all the process to be semi-transparent. Since many people will use it with different versions and settings at the same time. So manually copying the files to 20+servers is NOT an option.
Thank you!
Put the program/script that you want to execute by all machines on one common location on local network (put your configs there too). On all servers create a batch file say 'runme.bat' that will execute your program directly from network location.
This way you can use psexec to run runme.bat essentially executing your program/script on any server you want.
Since often - there are issues using psexec - you may invoke your scripts from Task Scheduler etc.
I do that for 500+ servers and it works. If working for me it will work for you.
You might want to look at HTCondor (http://research.cs.wisc.edu/htcondor/) which could perhaps manage all of this for you.

Best Practice: AWS ftp with file processing

I'm looking for some direction on an AWS architictural decision. My goal is to allow users to ftp a file to an EC2 instance and then run some analysis on the file. My focus is to build this in as much a service-oriented way as possible.. and in the future scale it out for multiple clients where each would have there own ftp server and processing queue with no co-mingling of data.
Currently I have a dev EC2 instance with vsftpd installed and a node.js process running Chokidar that is continuously watching for new files to be dropped. When that file drops I'd like for another server or group of servers to be notified to get the file and process it.
Should the ftp server move the file to S3 and then use SQS to let the pool of processing servers know that it's ready for processing? Should I use SQS and then have the pool of servers ssh into the ftp instance (or other approach) to get the file rather than use S3 as a intermediary? Are there better approaches?
Any guidance is very much appreciated. Feel free to school me on any alternate ideas that might save money at high file volume.
I'd segregate it right down into small components.
Load balancer
FTP Servers in scaling group
Daemon on FTP Servers to move to S3 and then queue a job
Processing servers in scaling group
This way you can scale the ftp servers if necessary, or scale the processing servers (on SQS queue length or processor utilisation). You may end up with one ftp server and 5 processing servers, or vice versa - but at least this way you only scale at the bottleneck.
The other thing you may want to look at is DataPipeline - which (whilst not knowing the details of your job) sounds like it's tailor made for your use case.
S3 and queues are cheap, and it gives you more granular control around the different components to scale as appropriate. There are potentially some smarts around wildcard policies and IAM you could use to tighten the data segregation too.
Ideally I would try to process the file on the server that it is currently placed.
This will save a lot of network traffic and CPU load.
However if you want one of the servers to be like reverse proxy and to load balance between farm of servers, then I will notify the server with http call that the file has arrived. I would made the file available via ftp since you already has working vsftp that will not be a problem and will include the file ftp url in the http call, so the server that will do the processing can get the file and start working on it immediately.
This way you will save money by not using any extra S3 or SQS or any other additional services.
If the farm of servers are made of equal type of servers, then the algorithm for distributing the load should be RoundRobin if the servers are with different capacity then the load distribution should be made according to the server performance.
For example if server ONE has 3 times more performnce then server THREE and server TWO has 2 times better performance than server THREE, then you can do:
1: Server ONE - forward 3 request
2: Server TWO - forward 2 request
3: Server THREE - forward 1 request
4: GOTO 1
Ideally there should be feedback from the serves that report the current load so the load-balancer knows who is the best candidate for the next request instead of using hard-coded algorithms, since probably the requests do not need exactly equal amount of resources to be processed, but this start looking like Map-Reduce paradigm and is out of the scope ... at least for the begining. :)
If you want to stick with S3, you could use RioFS to mount S3 bucket as a local filesystem on your FTP and processing servers. Then you could do the usual file operations (e.g.: get the notification when a file was created / modified).
As well as RioFS s3fs-fuse utilizes FUSE to provide a filesystem that is (virtual locally) mountable; s3fs-fuse is currently well-maintained.
In contrast the Filesystem Abstraction for S3, HDFS and normal filesystem
swineherd-fs allows to have a different (locally virtual) approach:
All filesystem abstractions implement the following core methods, based on standard UNIX functions, and Ruby's File class [...].
As the 'local abstraction layer' has been Only thoroughly tested on Ubuntu Linux i'd personally go for a more mainstream/solid/less experimental stack, i.e.:
a (sandboxed) vsftpd for
FTP transfers
(optionally) listen for filesystem changes and finally
trigger
middleman-s3_sync to do the cloud lift (or synchronize all by itself).
Alternatively, and more experimental, there are some github projects that might fit it:
s3-ftp: FTP server front-end
that forwards all uploads to an S3 bucket (Clojure)
ftp-to-s3: An FTP server that
uploads every file it receives to S3 (Python)
ftp-s3: FTP frontend to S3 in Python.
Last but not least i do recommend using donationware Cyberduck if on OSX - a comfortable (and very FTP-like) client interfacing S3 directly. For Windows there is a (PRO optional) freeware named S3 Browser.

How to transfer and execute .exe(py2exe) file in remote windows machines in a network using python

I have a set of python files which takes the information from windows registry,I have converted these files in to a .exe using py2exe.Now I need to scan the network and detect the active endpoints in it, after detecting the endpoint/machines in the network I need to send this .exe to to the machines and make it execute.How to do this in python.I have a code for transfer of files,My problem is how to execute the .exe file in remote windows machine (I have the credentials of the machine)& get back the results to the server?
I need to accomplish this task using python.
Any help or suggestions please.
Thanks in advance