Yarn in Aazon EC2 with whirr - amazon-web-services

Yarn in Aazon EC2 with whirr - amazon-web-services

i am trying to configue Yarn 2.2.0 with whirr in Amazon EC2. however I am having some problems. I have modified the whirr services to support yarn 2.2.0. As a result I am able to start the jobs and run them successfully. however I am facing n issue in tracking the job progress.
mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1397996350238_0001
2014-04-20 21:57:24,544 INFO [main] mapred.ClientServiceDelegate (ClientServiceDelegate.java:getProxy(270)) - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
java.io.IOException: Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:599)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1327)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
at com.zetaris.hadoop.seek.preprocess.PreProcessorDriver.executeJobs(PreProcessorDriver.java:112)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executePreProcessJob(JobToJobMatchingDriver.java:143)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:78)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:43)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.main(JobToJobMatchingDriver.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212
I tried Debugginh the problem is with the ApplicationMaster. It has an hostname and rpc port , in which the hostname is the internal hostname which can only be resolved from within the amazon network. Idealy it should have been a public Amazon DNs name. however I could'nt set it yet. I tried setting parameters like
yarn.nodemanager.hostname
yarn.nodemanager.address
But I couldnt find any change in the ApplicationMaster's hostname or port they are still the private amazon internal hostname. Am I missing anything. Or should I change the /etc/hosts in all node manager nodes so that node managers start with the public address..
But that will be an overkill right.Or is there any way I can configure the ApplicationMaster to take the public ip.So that I can Remotely track the progress
I am doing this all because I need to submit the jobs remotely.I am not willing to compromise this feature. Anyone out there who an guide me
I was successful in configuring the historyserver and I am able to access then from the remote client. I used the configuration to do it.
mapreduce.jobhistory.webapp.address
When i debugged I find the
MRClientProtocol MRClientProxy = null;
try {
MRClientProxy = getProxy();
return methodOb.invoke(MRClientProxy, args);
} catch (InvocationTargetException e) {
// Will not throw out YarnException anymore
LOG.debug("Failed to contact AM/History for job " + jobId +
" retrying..", e.getTargetException());
// Force reconnection by setting the proxy to null.
realProxy = null;
proxy failing to connect because of the private address . And above code snipped is from ClientServiceDelegate

I was able to avoid the issue. Rather than solve this. The problem is with the resolution of ip outside the cloud environment.
Initially I tried updating the whirr-yarn source to make use of public ip for configurations rather than private ip. But still There where issues.So I gave up the task.
What I finally did was to start job form the cloud environment itself. rther than from a host outside the cloud infrastructure. Hope somebody found a better way.

I had the same problem. Solved by adding following lines in mapred-site.yml. It move's your staging directory from default tmp directory to your home directory where your have permission.
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
In addition to this, you need to create a history directory on hdfs:
hdfs dfs -mkdir -p /user/history
hdfs dfs -chmod -R 1777 /user/history
hdfs dfs -chown mapred:hadoop /user/history
I found this link quite useful for configuring a Hadoop cluster.

conf.set("mapreduce.jobhistory.address", "hadoop3.hwdomain:10020");
conf.set("mapreduce.jobhistory.intermediate-done-dir", "/mr-history/tmp");
conf.set("mapreduce.jobhistory.done-dir", "/mr-history/done");

Related

AWS Airflow v2.0.2 doesn't show Google Cloud connection type

I want to load data from Google Storage to S3
To do this I want to use GoogleCloudStorageToS3Operator, which requires gcp_conn_id
So, I need to set up Google Cloud connection type
To do this, I added
apache-airflow[google]==2.0.2
to requirements.txt
but Google Cloud connection type is still not in Dropdown list of connections in MWAA
Same approach works well with mwaa local runner
https://github.com/aws/aws-mwaa-local-runner
I guess it does not work in MWAA because of security reasons discussed here
https://lists.apache.org/thread.html/r67dca5845c48cec4c0b3c34c3584f7c759a0b010172b94d75b3188a3%40%3Cdev.airflow.apache.org%3E
But still, is there any workaround to add Google Cloud connection type in MWAA?

Connections can be created and managed using either the UI or environment variables.
To my understanding the limitation that MWAA have over installation of some provider packages are only on the web server machine which is why the connections are not listed on the UI. This doesn't mean you can't create the connection at all, it just means you can't do it from the UI.
You can define it from CLI:
airflow connections add [-h] [--conn-description CONN_DESCRIPTION]
[--conn-extra CONN_EXTRA] [--conn-host CONN_HOST]
[--conn-login CONN_LOGIN]
[--conn-password CONN_PASSWORD]
[--conn-port CONN_PORT] [--conn-schema CONN_SCHEMA]
[--conn-type CONN_TYPE] [--conn-uri CONN_URI]
conn_id
You can also generate a connection URI to make it easier to set.
Connections can also be set as environment variable. Example:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://?extra__google_cloud_platform__key_path=%2Fkeys%2Fkey.json&extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&extra__google_cloud_platform__project=airflow&extra__google_cloud_platform__num_retries=5'
If needed you can check the google provider package docs to review the configuration options of the connection.

For MWAA there are 2 options to set connection:
Setting environment variable.
Using pattern AIRFLOW_CONN_YOUR_CONNECTION_NAME,
where e.g. YOUR_CONNECTION_NAME = GOOGLE_CLOUD_DEFAULT.
That can be done using custom plugin
https://docs.aws.amazon.com/mwaa/latest/userguide/samples-env-variables.html
Using secret manager
https://docs.aws.amazon.com/mwaa/latest/userguide/connections-secrets-manager.html
Tested for google cloud connection, both are working.

I asked AWS support about this issue. Looks like they are working on it.
They told me a way to configure the the google cloud platform connection passing a json object in the extras with Conn Type as HTTP. And it works.
I have validated editing google_cloud_default (Airflow > Admin > Connections)
Conn Type: HTTP
Extra:
{
"extra__google_cloud_platform__project":"<YOUR_VALUE>",
"extra__google_cloud_platform__key_path":"",
"extra__google_cloud_platform__keyfile_dict":"{"type": "service_account","project_id": "<YOUR_VALUE>","private_key_id": "<YOUR_VALUE>", "private_key": "-----BEGIN PRIVATE KEY-----\n<YOUR_VALUE>\n-----END PRIVATE KEY-----\n", "client_email": "<YOUR_VALUE>", "client_id": "<YOUR_VALUE>", "auth_uri": "https://<YOUR_VALUE>", "token_uri": "https://<YOUR_VALUE>", "auth_provider_x509_cert_url": "https://<YOUR_VALUE>", "client_x509_cert_url": "https://<YOUR_VALUE>"}",
"extra__google_cloud_platform__scope":"",
"extra__google_cloud_platform__num_retries":"5"
}
airflow conn screenshot
!! You must escape the " and /n in extra__google_cloud_platform__keyfile_dict !!
In requirements.txt I used:
apache-airflow[gcp]==2.0.2
(I believe apache-airflow[google]==2.0.2 should work as well)

eks fluent-bit to elasticsearch timeout

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?

So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

EndpointNotFound: public endpoint for hpext:dns service in RegionOne region not found

I have installed designate client on the same box where designate server is running with OpenStack Juno. After setting environment by issuing . .venv/bin/activate and keystone variables by issuing this command keystonerc_admin.
When I try to run designate --debug server-list command I am getting this error:
EndpointNotFound: public endpoint for hpext:dns service in RegionOne region not found
Please help me out.

I was able to solve this. This was my mistake. I have exported hpext:dns in keystone_admin and .bashrc file.
This value is very much specific if anyone is using hp cloud and they are logging into their geos.

Yes, that value is from before designate was an incubated project, but was running in HP Cloud.
The standard 'dns' service should be used for anyone not using the HP Public Cloud service (it is the default in python-designateclient, so you shouldn't have to do anything)

HiveMQ AWS cluster not starting

When trying to set up a HiveMQ (1.4.1) Cluster on AWS (using the bundles examples/cluster.xml file) the server upstart fails with an error:
Unable to load class for protocol pbcast.STREAMING_STATE_TRANSFER
There's really no more info on why this would fail.

The problem was that the examples/cluster.xml file used the old syntax for STREAMING_STATE_TRANSFER, while it should use just STATE_TRANSFER instead.
Hope that helps someone else. :)

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?

It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Yarn in Aazon EC2 with whirr - amazon-web-services

conf.set("mapreduce.jobhistory.address", "hadoop3.hwdomain:10020"); conf.set("mapreduce.jobhistory.intermediate-done-dir", "/mr-history/tmp"); conf.set("mapreduce.jobhistory.done-dir", "/mr-history/done");

Related

AWS Airflow v2.0.2 doesn't show Google Cloud connection type

eks fluent-bit to elasticsearch timeout

EndpointNotFound: public endpoint for hpext:dns service in RegionOne region not found

HiveMQ AWS cluster not starting

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

Categories

Resources