Running an EMR Spark script, and the Spark UI SQL tab disappears - amazon-web-services

When running spark sql jobs on local mode, my Spark UI has a SQL tab.
When running that same job on AWS EMR, the Spark UI's SQL job is no longer there?
I've SSH tunneled
setup the FoxyProxy settings
can view the various EMR UI's in the browser
can view the Spark UI in the browser
is there a code reason why this would not be there in Amazon's EMR version of the Spark UI?

There are two different UI's for Spark in EMR: 1. the Spark UI, and 2. the Spark History Server.
They look the same, but they are served from different URLs, and one of them continues to be served even after an application completes, as long as the EMR cluster is still up and running.
The history server AFAIK does not have a SQL tab

Related

Can't reach flask in Spark master node using Amazon EMR

I want to understand if it's possible to use flask application connected to Spark master node implemented in Amazon EMR. The goal is to call Flask from a web app to retrieve spark outputs. Ports are open in amazon EMR cluster's security group but I can't reach it from outside on his port.
What do you think about it? Are there any other solutions?
While it is totally possible to call Flask (or anything) running on EMR, depending on what you are doing you might find Apache Livy handy. The good thing is Livy is fully supported by EMR. You can use Livy to submit jobs and to retrieve results synchronously or asynchronously. It gives you a rest API to interact with Spark.

How to view AWS Glue Spark UI

In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work.
How can I view the DAG/Spark UI of my Glue job?
You need to setup an ec2 instance that can host the history server.
The below documentation has links to CloudFormation templates that you can use.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
You can access the history server via the ec2 instance(default on 18080). You need to configure the networks and ports suitably.
EDIT - There is also an option to setup SparkUI locally. This requires downloading the docker image from aws-glue-samples repo amd settin the AWS credential and s3 location there. This server consummes the files that the glue job generates. The files are about 4MB large.

AWS EMR Presto job

Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3?
I would not like to SSH to the instance to execute the commands, but do it automatically
You can use any JDBC/ODBC client to connect with Presto cluster. You can do this with available drivers if you want to connect programmatically.
https://prestodb.github.io/docs/current/installation/jdbc.html
https://teradata.github.io/presto/docs/0.167-t/installation/odbc.html
If you do have any BI tool like tableau, Clickview, Superset then it can be easily done as well.
e.g https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_presto.htm

Spark HBase to Google Dataproc and Bigtable migration

I have HBase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all HBase data to BigTable. Now I am running same Spark - Java/Scala job in Dataproc. Spark job failing as it is looking spark.hbase.zookeeper.quorum setting.
Please let me know, how without code change I can make my spark job to run successfully with BigTable.
Regards,
Neeraj Verma
While BigTable shares same principles and same Java API is available as HBase, it is not sharing its wire protocol. So standard HBase Client won't work (zookeeper error looks like you are trying to connect to BigTable via HBase client). Instead, you need to modify your program to use BigTable-specific client. It implements same Java interfaces as HBase, but requires custom google jars in classpath and few property overrides to enable it.

Apache Spark: List all the Sparks job running on the cluster

Is there a command that list all the spark jobs running on the cluster?
I am new to this technology and we have multiple users running spark-submit jobs on a aws cluster. Is there a way to list all the spark jobs running?
Thank you!
Use Spark REST API. It can be invoked from active Spark Web UI or from History Server. Of course, as cricket_007 said, you can also list jobs in UI. These UIs and REST services are running in all cluster types