Is there any way possible to have a single history server showing Spark applications running on different emr clusters?
According to this link - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html
Section Off-cluster access to persistent application user interfaces states that the persistent application UIs are run off-cluster, but can it be configured such that every spark application (running on different clusters) can be pointed to single application UI? Or is it cluster specific only?
I have tried figuring it out from the aws docs but can't find anything relevant. Any reference/suggestion will be appreciated.
Thanks.
Related
Can we run an application that is configured to run on multi-node AWS EC2 K8s cluster using kops (project link) into local Kubernetes cluster (setup using kubeadm)?
My thinking is that if the application runs in k8s cluster based on AWS EC2 instances, it should also run in local k8s cluster as well. I am trying it locally for testing purposes.
Heres what I have tried so far but it is not working.
First I set up my local 2-node cluster using kubeadm
Then I modified the installation script of the project (link given above) by removing all the references to EC2 (as I am using local machines) and kops (particularly in their create_cluster.py script) state.
I have modified their application yaml files (app requirements) to meet my localsetup (2-node)
Unfortunately, although most of the application pods are created and in running state, some other application pods are unable to create and therefore, I am not being able to run the whole application on my local cluster.
I appreciate your help.
It is the beauty of Docker and Kubernetes. It helps to keep your development environment to match production. For simple applications, written without custom resources, you can deploy the same workload to any cluster running on any cloud provider.
However, the ability to deploy the same workload to different clusters depends on some factors, like,
How you manage authorization and authentication in your cluster? for example, IAM, IRSA..
Are you using any cloud native custom resources - ex, AWS ALBs used as LoadBalancer Services
Are you using any cloud native storage - ex, your pods rely on EFS/EBS volumes
Is your application cloud agonistic - ex using native technologies like Neptune
Can you mock cloud technologies in your local - ex. Using local stack to mock Kinesis, Dynamo
How you resolve DNS routes - ex, Say you are using RDS n AWS. You can access it using a route53 entry. In local you might be running a mysql instance and you need a DNS mechanism to discover that instance.
I did a google search and looked at the documentation of kOps. I could not find any info about how to deploy to local, and it only supports public cloud providers.
IMO, you need to figure out a way to set up your local EKS cluster, and if there are any usage of cloud native technologies, you need to figure out an alternative way about doing the same in your local.
The true answer, as Rajan Panneer Selvam said in his response, is that it depends, but I'd like to expand somewhat on his answer by saying that your application should run on any K8S cluster given that it provides the services that the application consumes. What you're doing is considered good practice to ensure that your application is portable, which is always a factor in non-trivial applications where simply upgrading a downstream service could be considered a change of environment/platform requiring portability (platform-independence).
To help you achieve this, you should be developing a 12-Factor Application (12-FA) or one of its more up-to-date derivatives (12-FA is getting a little dated now and many variations have been suggested, but mostly they're all good).
For example, if your application uses a database then it should use DB independent SQL or no-sql so that you can switch it out. In production, you may run on Oracle, but in your local environment you may use MySQL: your application should not care. The credentials and connection string should be passed to the application via the usual K8S techniques of secrets and config-maps to help you achieve this. And all logging should be sent to stdout (and stderr) so that you can use a log-shipping agent to send the logs somewhere more useful than a local filesystem.
If you run your app locally then you have to provide a surrogate for every 'platform' service that is provided in production, and this may mean switching out major components of what you consider to be your application but this is ok, it is meant to happen. You provide a platform that provides services to your application-layer. Switching from EC2 to local may mean reconfiguring the ingress controller to work without the ELB, or it may mean configuring kubernetes secrets to use local-storage for dev creds rather than AWS KMS. It may mean reconfiguring your persistent volume classes to use local storage rather than EBS. All of this is expected and right.
What you should not have to do is start editing microservices to work in the new environment. If you find yourself doing that then the application has made a factoring and layering error. Platform services should be provided to a set of microservices that use them, the microservices should not be aware of the implementation details of these services.
Of course, it is possible that you have some non-portable code in your system, for example, you may be using some Oracle-specific PL/SQL that can't be run elsewhere. This code should be extracted to config files and equivalents provided for each database you wish to run on. This isn't always possible, in which case you should abstract as much as possible into isolated services and you'll have to reimplement only those services on each new platform, which could still be time-consuming, but ultimately worth the effort for most non-trival systems.
I want to understand if it's possible to use flask application connected to Spark master node implemented in Amazon EMR. The goal is to call Flask from a web app to retrieve spark outputs. Ports are open in amazon EMR cluster's security group but I can't reach it from outside on his port.
What do you think about it? Are there any other solutions?
While it is totally possible to call Flask (or anything) running on EMR, depending on what you are doing you might find Apache Livy handy. The good thing is Livy is fully supported by EMR. You can use Livy to submit jobs and to retrieve results synchronously or asynchronously. It gives you a rest API to interact with Spark.
Context
I run spark applications on an Amazon EMR cluster.
These applications are orchestrated by Yarn.
I didn't define yarn.nodemanager.log-dirs, spark.yarn.historyServer.address or other configurations.
In Application history tab there is information about 'steps'.
I looked to official documentation (View Application History), but didn't find an information where is "Application history" stored.
Question
I would like to know where is "Application history" stored by default.
Thanks in advance.
UPD
The main problem is that the cluster is off and I can't look on master node
I am evaluating stackdriver from GCP for logging across multiple micro services.
Some of these services are deployed on premise and some of them are on AWS/GCP.
Our services are either .NET or nodejs based apps and we are invested in winston for nodejs and nlog in .net.
I was looking # integrating our on-premise nodejs application with stackdriver logging. Looking # https://cloud.google.com/logging/docs/setup/nodejs the documentation it seems that there we need to install the agent for any machine other than the google compute instances. Is this correct?
if we need to install the agent then is there any way where I can test the logging during my development? The development environment is either a windows 10/mac.
There's a new option for ingesting logs (and metrics) with Stackdriver as most of the non-google environment agents look like they are being deprecated. https://cloud.google.com/stackdriver/docs/deprecations/third-party-apps
A Google post on logging on-prem resources with stackdriver and Blue Medora
https://cloud.google.com/solutions/logging-on-premises-resources-with-stackdriver-and-blue-medora
for logs you still need to install an agent on each box to collect the logs, it's a BindPlane agent not a Google agent.
For node.js, you can use the #google-cloud/logging-winston and #google-cloud/logging-bunyan modules from anywhere (on-prem, AWS, GCP, etc.). You will need to provide projectId and auth credentials manually if not running on GCP. Instructions on how to set these up is available in the linked pages.
When running on GCP we figure out the exact environment (App Engine, Compute Engine, etc.) automatically and the logs should up under those resources in the Logging UI. If you are going to use the modules from your development machines, we will report the logs against the 'global' resource by default. You can customize this by passing a specific resource descriptor yourself.
Let us know if you run into any trouble.
I tried setting this up on my local k8s cluster. By following this: https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/
But i couldnt get it to work, the fluentd-gcp-v2.0-qhqzt keeps crashing.
Also, the page mentions that there are multiple issues with stackdriver logging if you DONT use it on google GKE. See the screenshot.
I think google is trying to lock you in into GKE.
Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!
Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html