How to view AWS Glue Spark UI - amazon-web-services

In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work.
How can I view the DAG/Spark UI of my Glue job?

You need to setup an ec2 instance that can host the history server.
The below documentation has links to CloudFormation templates that you can use.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
You can access the history server via the ec2 instance(default on 18080). You need to configure the networks and ports suitably.
EDIT - There is also an option to setup SparkUI locally. This requires downloading the docker image from aws-glue-samples repo amd settin the AWS credential and s3 location there. This server consummes the files that the glue job generates. The files are about 4MB large.

Related

Using AWS Lambda to copying S3 files to on-premise LAN folder

Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Where to put application property file for spark application running on AWS EMR

I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

best practise for logging on Elastic Map Reduce - AWS

I am planning to use Amazon EMR for spark streaming application. Amazon provides a nice interface to show stderr & controller log. But for streaming application I am not sure how manage the logs.
Amazon logs the data to /var/log/hadoop/steps/<step-id> and similar places for spark. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-view-web-log-files.html
I was wondering on how do we rotate logs and still be accessible via the aws emr web interface. We can easily change the log rotation policy by configuring the hadoop-log4j, but that way I cannot access it via the web interface. Also EMR should manage the log s3 upload
AWS EMR also stores the logs in S3.
Navigate to your cluster console for the running cluster, and in the left middle column, you'll see the path to the s3 bucket.
Be careful not to reuse the same s3 bucket path for future clusters, otherwise you could overwrite your log data.

Setting hive properties in Amazon EMR?

I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR