I have relatively simple task to do but struggle with best AWS service mix to accomplish that:
I have simple java program (provided by 3rd party- I can't modify that, just use) that I can run anywhere with java -jar --target-location "path on local disc". The program, once executed, is creating csv file on local disc in path defied in --target-location
Once file is created I need to upload it to S3
The way I'm doing it currently is by having dedicated EC2 instance with java installed and first point is covered by java -jar ... and second with aws s3 cp ... command
I'm looking for better way of doing that (preferably serverless). I'm wandering if above points can be accomplished with AWS Glue Job type Python Shell? Second point (copy local file to S3), most likely I can cover with boto3 but first (java -jar execution)- I'm not sure.
Am I force to use EC2 instance or you see smarter way with AWS Glue?
Or most effective would be to build docker image (that contains this two instructions), register in ECR and run wit AWS Batch?
I'm looking for better way of doing that (preferably serverless).
I cannot tell if a serverless option is better, however, an EC2 instance will do the job just fine. Assume that you have CentOS on your instance, you may do it through
aaPanel GUI
Some useful web panels offer cron scheduled tasks, such as backing up some files from one directory to another S3 directory. I will use aaPanel as an example.
Install aaPanel
Install AWS S3 plugin
Configure the credentials in the plugin.
Cron
Add a scheduled task to back up files from "path on local disc" to AWS S3.
Rclone
A web panel goes beyond the scope of this question. Rclone is another useful tool I use to back up files from local disk to OneDrive, S3, etc.
Installation
curl https://rclone.org/install.sh | sudo bash
Sync
Sync a directory to the remote bucket, deleting any excess files in the bucket.
rclone sync -i /home/local/directory remote:bucket
Related
Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.
I am using the googleapiclient in python to launch VM instances. As part of that I am using the facility to run start up scripts to install docker and other python packages.
Now, one thing I would like to do is copy files to this instance ideally during the instance creation stage through python code.
What might be the way to achieve this? Ideally what would work is to be able to detect that the instance has booted and then be able to copy these files.
If I am hearing you correctly, you want files to be present inside the container that is being executed by Docker in your Compute Engine VM. Your Startup Script for the Compute Engine is installing docker.
My recommendation is not to try and copy those files into the container but instead, have them available on the local file system available to the Compute Engine. Configure your docker startup to then mount the directory from the Compute Engine into the docker container. Inside the docker container, you would now have accessibility to the desired files.
As for bringing the files into the Compute Engine environment in the first place, we have a number of options. The core story however will be describing where the files start from in the first place.
One common approach is to keep the files that you want copied into the VM in a Google Cloud Storage (GCS) bucket/folder. From there, your startup script can use GCS API or the gsutil command to copy the files from the GCS bucket to the local file system.
Another thought, and again, this depends on the nature of the files ... is that you can create a GCP disk that simply "contains" the files. When you now create a new Compute Engine instance, that instance could be defined to mount the disk which is shared read-only across all the VM instances.
First of all, I would suggest to use tool like Terraform or Google Deployment Manager to create cloud infrastructure instead of writing custom python code and handling all edge-cases by yourself.
For some reason, you can't use above tool and only Python program is an option for you the you can do following:
1. Create a GCS bucket using python api and put appropriate bucket policy to protect data.
2. Create a service account which has read permission to above GCS bucket.
3. Launch VM instance using python API and have your start-up script to install packages and run docker container. Attach above service account which has permission to read files from above GCS bucket.
3. Have a startup script in your docker container which can run ``gsutil` command to fetch files from GCS bucket and put at the right place.
Hope this helps.
Again, if you can use tools like Terraform, that will make things easy.
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards
I have been assigned with automating the task of fetching some files(csv/Excel) from certain websites and then loading them to S3. I would like to know if a script can be written in someway so that if you put the url and the s3 bucket path and run the script, the files would load onto S3; and if yes, how would I go about it?
Open to alternative solutions as well.
Thanks!
You can write a simple bash script, or any other language of your choice. Below are the steps for bash script:
Download the csv file using curl or wget.
Upload using aws-cli tools, aws s3 cp
Make sure you have aws-cli installed on the machine where you will be running this script.
You may want to read more about AWS CLI and AWS CLI S3 commands.
I have worked in cloudera Box and I put all my scripts in edge node. I am new to EMR in aws ,so I need ur suggestion.
What I have done.
1.I have logged into master node By ssh using putty.
2. Created folders where I put all my scripts.
I have read some article to put the scripts in s3. But May I know is there any problem going with the approach, I have mentioned.
Do I need stand up an ec2 linux , where I can put these scripts and call emr jobs from that ec2 box.
Need ur view.
Sanjeeb
The approach you have taken is correct. We have scripts on EMR master node as well as S3. The advantage of having on S3 is that, if EMR crashes, you have scripts on S3. Additionally, if you are executing from multiple EMR's, having the script on S3 makes it easier to invoke it from S3 itself instead of copying to each EMR instance.
You can invoke pig scripts from S3 using sh -c 'pig -f ..'
There is no point in having additional ec2 running just to invoke the jobs.
How are you calling your emr jobs?