Eclipse Ditto throws akka exception after deployment on EKS - akka
I'm running Eclipse Ditto v2.5.0 on EKS (helm chart) and after a couple of days the service stops working. It doesn't return any results nor is persisting new things working. I've found the following in the logs:
2022-06-28T08:06:12+02:00 Caused by: akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-139] Remote partner [Actor[akka://ditto-cluster#10.20.87.204:2551/system/Materializers/StreamSupervisor-0/$$q2c-SinkRef-139#-1677314214]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.onTimer(SourceRefImpl.scala:374)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.onInternalTimer(GraphStage.scala:1665)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1$adapted(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:467)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:517)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 ... 10 common frames omitted
2022-06-28T08:06:12+02:00 2022-06-28 08:06:12,408 ERROR [] o.e.d.i.u.a.ThingsAggregatorProxyActor akka://ditto-cluster/user/gatewayRoot/proxy/aggregatorProxy - [retrieve-thing-response] Upstream failed.
2022-06-28T08:06:12+02:00 akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://ditto-cluster#10.20.87.204:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.onTimer(SourceRefImpl.scala:374)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.onInternalTimer(GraphStage.scala:1665)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1$adapted(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:467)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:517)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.invoke(ActorCell.scala:548)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.invoke(ActorCell.scala:548)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.run(Mailbox.scala:231)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
2022-06-28T08:06:12+02:00 2022-06-28 08:06:12,410 ERROR [78dae9eb-4515-4513-9930-3060f7ba9652] o.e.d.g.s.e.a.HttpRequestActor akka://ditto-cluster/user/$Xe - Got <Status.Failure> when a command response was expected: <akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://ditto-cluster#10.32.57.210:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.>!
2022-06-28T08:06:12+02:00 java.util.concurrent.CompletionException: akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://ditto-cluster#10.32.57.210:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at org.eclipse.ditto.gateway.service.endpoints.actors.AbstractHttpRequestActor.lambda$getResponseAwaitingBehavior$21(AbstractHttpRequestActor.java:387)
2022-06-28T08:06:12+02:00 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
2022-06-28T08:06:12+02:00 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
2022-06-28T08:06:12+02:00 at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
and
2022-06-27T16:22:19+02:00 2022-06-27 16:22:19,305 ERROR [] a.m.c.b.i.HttpContactPointBootstrap akka://ditto-cluster#12.25.88.222:2551/system/bootstrapCoordinator/contactPointProbe-10-20-68-87.ditto.pod.cluster.local-8558 - Overdue of probing-failure-timeout, stop probing, signaling that it's failed
How can I debug this and determine what the root cause might be?
The logs in indicate that you have tried to get several things via HTTP.
The gateway service received this error as we see in:
2022-06-28T08:06:12+02:00 at org.eclipse.ditto.gateway.service.endpoints.actors.AbstractHttpRequestActor.lambda$getResponseAwaitingBehavior$21(AbstractHttpRequestActor.java:387)
The ThingsAggregatorProxyActor is used to get the each thing you requested from the things service in your EKS.
I would check the ditto health endpoint.
Assuming you use a nginx in your EKS you should be able to call it using
the devops user under localhost:30080/status/health >>> Source
If you aren't using nginx just call the gateway pod.
For example: gateway:8080/status/health
Check the logs of the things pod as well and also if the pod was restarted or had any kinds of issues.
Related
Tendermint : get error when start tendermint node as per the sample
I tried to create a application as per the guild in the tendermint document. After I started the application and tendermint node, getting the below error. I use go version go1.15.5 linux/amd64 and tednermint v0.34.0-rc4-148-g095e9cd. [bc#localhost kvstore]$ TMHOME="/tmp/example" tendermint node --proxy_app=unix://example.sock I[2020-12-01|13:16:59.697] Version info module=main software=v0.34.0-rc4-148-g095e9cd block=11 p2p=8 I[2020-12-01|13:16:59.702] Starting Node service module=main impl=Node I[2020-12-01|13:16:59.703] Starting StateSync service module=statesync impl=StateSync I[2020-12-01|13:16:59.738] Started node module=main nodeInfo="{ProtocolVersion:{P2P:8 Block:11 App:0} DefaultNodeID:3cf5ea6219c57fd906c042f767748988ba070db7 ListenAddr:tcp://0.0.0.0:26656 Network:test-chain-1mrgVg Version:v0.34.0-rc4-148-g095e9cd Channels:40202122233038606100 Moniker:localhost.localdomain Other:{TxIndex:on RPCAddress:tcp://127.0.0.1:26657}}" E[2020-12-01|13:17:00.758] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus E[2020-12-01|13:17:00.758] consensus connection terminated. Did the application crash? Please restart tendermint module=proxy err="read message: EOF" E[2020-12-01|13:17:00.758] Error in proxyAppConn.BeginBlock module=state err="read message: EOF" E[2020-12-01|13:17:00.758] Error on ApplyBlock module=consensus err="read message: EOF" I[2020-12-01|13:17:00.758] captured terminated, exiting... module=main I[2020-12-01|13:17:00.758] Stopping Node service module=main impl=Node I[2020-12-01|13:17:00.758] Stopping Node module=main I[2020-12-01|13:17:00.760] Stopping StateSync service module=statesync impl=StateSync I[2020-12-01|13:17:00.760] Closing rpc listener module=main listener="&{Listener:0xc00000d440 sem:0xc000039200 closeOnce:{done:0 m:{state:0 sema:0}} done:0xc000039260}" E[2020-12-01|13:17:00.760] Error serving server module=main err="accept tcp 127.0.0.1:26657: use of closed network connection" KVStore [bc#localhost kvstore]$ ./example badger 2020/12/01 13:16:55 INFO: All 0 tables opened in 0s badger 2020/12/01 13:16:55 INFO: Replaying file id: 0 at offset: 0 badger 2020/12/01 13:16:55 INFO: Replay took: 6.807µs badger 2020/12/01 13:16:55 DEBUG: Value log discard stats empty I[2020-12-01|13:16:55.373] Starting ABCIServer service impl=ABCIServer I[2020-12-01|13:16:55.405] Waiting for new connection... I[2020-12-01|13:16:59.692] Accepted a new connection I[2020-12-01|13:16:59.692] Waiting for new connection... I[2020-12-01|13:16:59.692] Accepted a new connection I[2020-12-01|13:16:59.692] Waiting for new connection... I[2020-12-01|13:16:59.692] Accepted a new connection I[2020-12-01|13:16:59.692] Waiting for new connection... I[2020-12-01|13:16:59.692] Accepted a new connection I[2020-12-01|13:16:59.692] Waiting for new connection... E[2020-12-01|13:17:00.758] Connection error err="error reading message: proto: wrong wireType = 2 for field Height" E[2020-12-01|13:17:00.761] Connection was closed by client E[2020-12-01|13:17:00.761] Connection was closed by client E[2020-12-01|13:17:00.761] Connection was closed by client go.mod module github.com/me/example go 1.15 require ( github.com/dgraph-io/badger v1.6.2 github.com/tendermint/tendermint v0.34.0-rc4 )
Debezium with AWS MSK NOT_ENOUGH_REPLICAS
I have a running debezium cluster in AWS, no issues with that. I want to give a try with AWS MSK. So I launched a cluster. Then I launched an EC2 for running my connectors. Then installed confluent-kafka sudo apt-get update && sudo apt-get install confluent-platform-2.12 By default the AWS MSK doesn't have schema registry, So I configured it from the connector EC2 Schema registry conf file: kafkastore.connection.url=z-1.bhuvi-XXXXXXXXX.amazonaws.com:2181,z-3.bhuvi-XXXXXXXXX.amazonaws.com:2181,z-2.bhuvi-XXXXXXXXX.amazonaws.com:2181 kafkastore.bootstrap.servers=PLAINTEXT://b-2.bhuvi-XXXXXXXXX.amazonaws.com:9092,PLAINTEXT://b-4.bhuvi-XXXXXXXXX.amazonaws.com:9092,PLAINTEXT://b-1.bhuvi-XXXXXXXXX.amazonaws.com:9092 Then /etc/kafka/connect-distributed.properties file bootstrap.servers=b-4.bhuvi-XXXXXXXXX.amazonaws.com:9092,b-3.bhuvi-XXXXXXXXX.amazonaws.com:9092,b-2.bhuvi-XXXXXXXXX.amazonaws.com:9092 plugin.path=/usr/share/java,/usr/share/confluent-hub-components Install connector: confluent-hub install debezium/debezium-connector-mysql:latest start the service systemctl start confluent-schema-registry systemctl start confluent-connect-distributed Now everything started. Then I created a mysql.json file. { "name": "mysql-connector-db01", "config": { "name": "mysql-connector-db01", "connector.class": "io.debezium.connector.mysql.MySqlConnector", "database.server.id": "1", "tasks.max": "3", "database.history.kafka.bootstrap.servers": "172.31.47.152:9092,172.31.38.158:9092,172.31.46.207:9092", "database.history.kafka.topic": "schema-changes.mysql", "database.server.name": "mysql-db01", "database.hostname": "172.31.84.129", "database.port": "3306", "database.user": "bhuvi", "database.password": "my_stong_password", "database.whitelist": "proddb,test", "internal.key.converter.schemas.enable": "false", "key.converter.schemas.enable": "false", "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter", "internal.value.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "transforms": "unwrap", "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState" "transforms.unwrap.add.source.fields": "ts_ms", } } Create debezium connector curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d #mysql.josn Then its stated giving this error in the connector EC2. Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,290] WARN [Producer clientId=producer-3] Got error produce response with correlation id 844 on topic-partition connect-configs-0, retrying (2147482809 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637) Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,391] WARN [Producer clientId=producer-3] Got error produce response with correlation id 845 on topic-partition connect-configs-0, retrying (2147482808 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637) Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,492] WARN [Producer clientId=producer-3] Got error produce response with correlation id 846 on topic-partition connect-configs-0, retrying (2147482807 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637) Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,593] WARN [Producer clientId=producer-3] Got error produce response with correlation id 847 on topic-partition connect-configs-0, retrying (2147482806 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637) It never stops this error message. Describe of connect-configs Topic:connect-configs PartitionCount:1 ReplicationFactor:1 Configs:cleanup.policy=compact Topic: connect-configs Partition: 0 Leader: 2 Replicas: 2 Isr: 2
MSK sets min.in.sync.replicas to 2 for all topics by default (see https://docs.aws.amazon.com/msk/latest/developerguide/msk-default-configuration.html) It possible that Kafka Connect is producing using ACKs="all" and, since you only have one copy of your topic, it never achieves enough quorum.
Spark shuts down after 10 seconds of running
I'm trying to setup clusters in my AWS account (Amazon). I followed this tutorial to set it up. I've ran into some problems regarding ports but I finally got it to work until... it shut down after 10 seconds giving me no more than this error: 16/05/12 12:52:46 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077... 16/05/12 12:53:06 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077... 16/05/12 12:53:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 16/05/12 12:53:26 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. This was the bash I ran to make it work: bin/spark-shell --master spark://ip-to-my-machine:7077 I opened the TCP port 7077, what seems to be the problem?
Error with Phusion passenger + Nginx for spawning the new application process
My application is based on Ruby 2.2.0, Rails 4.1.9, nginx 1.8.0, Phusion Passenger - 5.0.11 and the application is deployed on EC2 instance, 2 cores and 8GB RAM. But sometimes it shows me following error in the log file: stderr: Errno::ENOMEM App 13088 stderr: ) App 13088 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/preloader_shared_helpers.rb:69:in `accept_and_process_next_client' App 13088 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/preloader_shared_helpers.rb:139:in `run_main_loop' App 13088 stderr: from /usr/share/passenger/helper-scripts/rack-preloader.rb:154:in `<module:App>' App 13088 stderr: from /usr/share/passenger/helper-scripts/rack-preloader.rb:29:in `<module:PhusionPassenger>' App 13088 stderr: from /usr/share/passenger/helper-scripts/rack-preloader.rb:28:in `<main>' [ 2015-12-28 15:16:56.6347 24663/7f161b3e6700 App/Implementation.cpp:303 ]: Could not spawn process for application /var/www/project123: An error occurred while starting the web application. It exited before signalling successful startup back to Phusion Passenger. Error ID: 392ad659 Error details saved to: /tmp/passenger-error-fHmsZh.html Message from application: An error occurred while starting the web application. It exited before signalling successful startup back to Phusion Passenger. Please read this article for more information about this problem.<br> <h2>Raw process output:</h2> (empty) [ 2015-12-28 15:16:56.6348 24663/7f161b3e6700 Spa/SmartSpawner.h:726 ]: An error occurred while spawning a process: An error occurred while starting the web application. It exited before signalling successful startup back to Phusion Passenger. [ 2015-12-28 15:16:56.6390 24663/7f161b3e6700 Spa/SmartSpawner.h:727 ]: The application preloader seems to have crashed, restarting it and trying again... App 13205 stdout: Using: /home/ubuntu/.rvm/gems/ruby-2.2.0#found.fy App 13205 stdout: [ 2015-12-28 15:17:02.5000 24663/7f161b3e6700 App/Implementation.cpp:303 ]: Could not spawn process for application /var//www/project123: An error occurred while starting the web application. It exited before signalling successful startup back to Phusion Passenger. Error ID: 63c3eb0c Error details saved to: /tmp/passenger-error-62IksC.html Message from application: An error occurred while starting the web application. It exited before signalling successful startup back to Phusion Passenger. Please read this article for more information about this problem.<br> <h2>Raw process output:</h2> (empty) App 13205 stderr: /usr/lib/ruby/vendor_ruby/phusion_passenger/preloader_shared_helpers.rb:69:in `fork' -- INSERT -- 840,1 4% And my sites stop working, do you know why this is error is coming? I also check the memory usage but nothing cause to error related with the memory.
Spark 0.90 Stand alone connection refused
I am using spark 0.90 stand alone mode. When I tried with a streaming application in stand alone mode, I am getting a connection refused exception. I added hostname in /etc/hosts also tried with IP alone. In both cases worker got registered with master without any issues. Is there a way to solve this issue? 14/02/28 07:15:01 INFO Master: akka.tcp://driverClient#127.0.0.1:55891 got disassociated, removing it. 14/02/28 07:15:04 INFO Master: Registering app Twitter Streaming 14/02/28 07:15:04 INFO Master: Registered app Twitter Streaming with ID app-20140228071504-0000 14/02/28 07:34:42 INFO Master: akka.tcp://spark#127.0.0.1:33688 got disassociated, removing it. 14/02/28 07:34:42 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.165.35.96%3A38903-6#-1146558090] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/02/28 07:34:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.165.35.96:8910] -> [akka.tcp://spark#127.0.0.1:33688]: Error [Association failed with [akka.tcp://spark#127.0.0.1:33688]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#127.0.0.1:33688] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:33688
I had a similar issue when running in Spark in cluster mode. My problem was that the server was started with the hostname 'fluentd:7077' and not the FQDN. I edited the /sbin/start-master.sh to reflect how my remote nodes connect with the -ip flag. /usr/lib/jvm/jdk1.7.0_51/bin/java -cp :/home/vagrant/spark-0.9.0-incubating-bin- hadoop2/conf:/home/vagrant/spark-0.9.0-incuba ting-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.ap ache.spark.deploy.master.Master --ip fluentd.alex.dev --port 7077 --webui-port 8080 Hope this helps.