monitoring spark cluster in AWS EMR without spark UI - amazon-web-services

I am running a spark cluster on AWS EMR. How do I get all all the details of the jobs and executors that are running on AWS EMR without using the spark UI. I am going to use it for monitoring and optimization.

You can checkout nagios or ganglia for cluster health but you cant see the jobs running on spark with these tools.

If you are using AWS EMR you can do that using lynx server. something like below.
Login to the master node of the cluster.
try the below command
lynx http://localhost:4040
Note : before you type the command make sure you are running a job

Related

Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

AWS EMR Spark Application Logging to CloudWatch

It is not clear to me that
application logging inside a Spark App itself, running on AWS EMR,
executed via spark-shell or Steps
will end up in CloudWatch Logs for reporting, if the CloudWatch Agent is installed on the EMR Cluster.
Will it or not?

is there a way to kill a hive job without killing the AWS EMR cluster

I use AWS EMR cluster to run HIVE query. For query optimization purpose, sometime I need to kill a long-running step but keep the EMR cluster live so I can keep using it. Is there a way to do it either in HIVE CLI or AWS console?
Please refer here for the detail. To cancel steps using the AWS CLI:
aws emr cancel-steps --cluster-id j-2QUAJ7T3OTEI8 --step-ids s-3M8DKCZYYN1QE

Setup AWS Data Pipeline on long running EMR cluster

If I want to have long running EMR cluster and after that I want to setup Data Pipeline doing something on that cluster, how I can do it?
I must install Task Runner on this EMR cluster? Or maybe Task Runner will be preinstalled ? Or maybe there is other simple way ?
Task Runner does not come pre-installed in EMR. It has to be configured manually, follow these steps to install Task Runner in EMR cluster.
On starting the Task Runner process, provide a name for the --workerGroup. This name will be the identifier for this EMR cluster and can be used for the WorkerGroup field in Datapipeline activities.

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR