Storm Supervisors not starting after machine restart

Storm Supervisors not starting after machine restart - clojure

I have been running a storm-0.8.2 cluster for over a year now. Last night AWS restarted the supervisor machines. I have tried to restart the supervisor processes manually, but upon start up I receive this error message in the logs.
014-10-15 19:48:04 supervisor [ERROR] Error on initialization of server mk-supervisor
java.net.UnknownHostException: domU-<aws internal ip>: domU-<aws internal ip>
at java.net.InetAddress.getLocalHost(InetAddress.java:1454)
at backtype.storm.util$local_hostname.invoke(util.clj:153)
at backtype.storm.daemon.supervisor$supervisor_data.invoke(supervisor.clj:181)
at backtype.storm.daemon.supervisor$fn__4729$exec_fn__1200__auto____4730.invoke(supervisor.clj:331)
at clojure.lang.AFn.applyToHelper(AFn.java:167)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:601)
at backtype.storm.daemon.supervisor$fn__4729$mk_supervisor__4754.doInvoke(supervisor.clj:327)
at clojure.lang.RestFn.invoke(RestFn.java:436)
at backtype.storm.daemon.supervisor$_launch.invoke(supervisor.clj:477)
at backtype.storm.daemon.supervisor$_main.invoke(supervisor.clj:506)
at clojure.lang.AFn.applyToHelper(AFn.java:159)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.supervisor.main(Unknown Source)
I am not a clojure expert, but it looks like on line 215 of backtype.storm.daemon.supervisor.clj, that it is possible to set the localhost name in a config file.
215 :my-hostname (if (contains? conf STORM-LOCAL-HOSTNAME)
216 (conf STORM-LOCAL-HOSTNAME)
217 (local-hostname))
Is this possible? What file do I need to set this setting in? What is the correct key for this setting?
Or am I way off base and need to do something else to get my workers to restart?

I didn't face that situation before but if I were you, I would try:
Clear the directories used by Storm (the ones your configure in conf/storm.yaml.
If the previous step didn't solve the issue, then try mapping IPs to hostnames in your OS host file.

I got help from the user mailing list (user AT storm DOT apache DOT com). You can set the local host in your conf/storm.yaml file using the key "storm.local.hostname".

Add below entry to storm.yaml file
storm.local.hostname: "localhost"

Related

NetworkManager.conf reset to "404: NotFound" every time service is restarted

i encountered a problem while working with the NetworkManager service on my Raspberry pi 4, i am running it with openSUSE Tumbeweed JeOS, so the default is not networkmanager but wicked, i started to notice that every time i tried to restart it the service failed.
Watching through the logs i noticed this:
Oct 31 16:49:37 suse-server NetworkManager[2966]: <warn> [1635698977.7227] config: Default config file invalid: /etc/NetworkManager/NetworkManager.conf: Key file contains line “404: Not Found” which is not a key-value pair, group, or comment
Basically every time NetworkManager is restarted something overwrites the config file, probably with a predefined template from GitHub (when you open a invalid link in raw.github.com you get 404: Not Found), i temporarily switched back to wicked but i need to resolve this problem because i need to run HASS.IO, which unfortunately supports only NetworkManager.

eks fluent-bit to elasticsearch timeout

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?

So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

Getting a GETHDFS ERROR Admin Yield Error

I am getting this error on my GetHDFS and I don't understand why I set the
Hadoop Configuration Resources, Kerberos Principal, Kerebros Keytab and there are files in the path I just checked via superputty and it's a valid path.
Currently the GetHDFS is just linked to a logAttribute as I am trying to get each step working before moving to the next.
Overall Process: GETHDFS -> PUTEMAIL, I am trying to print out a count of the rows of the path(csv)

It looks like your core-site.xml has a default filesystem that references a hostname 'gwhdpdevnnha' that is not reachable from where NiFi is running. You can trouble shoot this outside of NiFi by checking if you can run ping with that hostname from the terminal.

Yarn in Aazon EC2 with whirr

i am trying to configue Yarn 2.2.0 with whirr in Amazon EC2. however I am having some problems. I have modified the whirr services to support yarn 2.2.0. As a result I am able to start the jobs and run them successfully. however I am facing n issue in tracking the job progress.
mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1397996350238_0001
2014-04-20 21:57:24,544 INFO [main] mapred.ClientServiceDelegate (ClientServiceDelegate.java:getProxy(270)) - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
java.io.IOException: Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:599)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1327)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
at com.zetaris.hadoop.seek.preprocess.PreProcessorDriver.executeJobs(PreProcessorDriver.java:112)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executePreProcessJob(JobToJobMatchingDriver.java:143)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:78)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.executeJobs(JobToJobMatchingDriver.java:43)
at com.zetaris.hadoop.seek.JobToJobMatchingDriver.main(JobToJobMatchingDriver.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212
I tried Debugginh the problem is with the ApplicationMaster. It has an hostname and rpc port , in which the hostname is the internal hostname which can only be resolved from within the amazon network. Idealy it should have been a public Amazon DNs name. however I could'nt set it yet. I tried setting parameters like
yarn.nodemanager.hostname
yarn.nodemanager.address
But I couldnt find any change in the ApplicationMaster's hostname or port they are still the private amazon internal hostname. Am I missing anything. Or should I change the /etc/hosts in all node manager nodes so that node managers start with the public address..
But that will be an overkill right.Or is there any way I can configure the ApplicationMaster to take the public ip.So that I can Remotely track the progress
I am doing this all because I need to submit the jobs remotely.I am not willing to compromise this feature. Anyone out there who an guide me
I was successful in configuring the historyserver and I am able to access then from the remote client. I used the configuration to do it.
mapreduce.jobhistory.webapp.address
When i debugged I find the
MRClientProtocol MRClientProxy = null;
try {
MRClientProxy = getProxy();
return methodOb.invoke(MRClientProxy, args);
} catch (InvocationTargetException e) {
// Will not throw out YarnException anymore
LOG.debug("Failed to contact AM/History for job " + jobId +
" retrying..", e.getTargetException());
// Force reconnection by setting the proxy to null.
realProxy = null;
proxy failing to connect because of the private address . And above code snipped is from ClientServiceDelegate

I was able to avoid the issue. Rather than solve this. The problem is with the resolution of ip outside the cloud environment.
Initially I tried updating the whirr-yarn source to make use of public ip for configurations rather than private ip. But still There where issues.So I gave up the task.
What I finally did was to start job form the cloud environment itself. rther than from a host outside the cloud infrastructure. Hope somebody found a better way.

I had the same problem. Solved by adding following lines in mapred-site.yml. It move's your staging directory from default tmp directory to your home directory where your have permission.
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
In addition to this, you need to create a history directory on hdfs:
hdfs dfs -mkdir -p /user/history
hdfs dfs -chmod -R 1777 /user/history
hdfs dfs -chown mapred:hadoop /user/history
I found this link quite useful for configuring a Hadoop cluster.

conf.set("mapreduce.jobhistory.address", "hadoop3.hwdomain:10020");
conf.set("mapreduce.jobhistory.intermediate-done-dir", "/mr-history/tmp");
conf.set("mapreduce.jobhistory.done-dir", "/mr-history/done");

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?

It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Storm Supervisors not starting after machine restart - clojure

I didn't face that situation before but if I were you, I would try: Clear the directories used by Storm (the ones your configure in conf/storm.yaml. If the previous step didn't solve the issue, then try mapping IPs to hostnames in your OS host file.

I got help from the user mailing list (user AT storm DOT apache DOT com). You can set the local host in your conf/storm.yaml file using the key "storm.local.hostname".

Add below entry to storm.yaml file storm.local.hostname: "localhost"

Related

NetworkManager.conf reset to "404: NotFound" every time service is restarted

eks fluent-bit to elasticsearch timeout

Getting a GETHDFS ERROR Admin Yield Error

Yarn in Aazon EC2 with whirr

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

Categories

Resources