istio upgrade from 1.4.6 -> 1.5.0 throws istiod erros : remote error: tls: error decrypting message - istio

Just upgraded istio from 1.4.6 (helm) to istio 1.5.0 (istioctl) [Purged istio and installed from istioctl] but it appears the istiod logs keep throwing the following :
2020-03-16T18:25:45.209055Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.111:56870": remote error: tls: error decrypting message
2020-03-16T18:25:46.792447Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.112:49162": remote error: tls: error decrypting message
2020-03-16T18:25:46.930483Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.160:36878": remote error: tls: error decrypting message
2020-03-16T18:25:48.284122Z info grpc: Server.Serve failed to complete security handshake from "10.150.52.230:44758": remote error: tls: error decrypting message
2020-03-16T18:25:48.288180Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.149:56756": remote error: tls: error decrypting message
2020-03-16T18:25:49.108515Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.151:53970": remote error: tls: error decrypting message
2020-03-16T18:25:49.111874Z info Handling event update for pod contentgatewayaidest-7f4694d87-qmq8z in namespace djin-content -> 10.150.53.50
2020-03-16T18:25:49.519861Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.91:59510": remote error: tls: error decrypting message
2020-03-16T18:25:50.133664Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.203:59726": remote error: tls: error decrypting message
2020-03-16T18:25:50.331020Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.195:59970": remote error: tls: error decrypting message
2020-03-16T18:25:52.110695Z info Handling event update for pod contentgateway-d74b44c7-dtdxs in namespace djin-content -> 10.150.56.215
2020-03-16T18:25:53.312761Z info Handling event update for pod dysonpriority-b6dbc589b-mk628 in namespace djin-content -> 10.150.52.91
2020-03-16T18:25:53.496524Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.111:57276": remote error: tls: error decrypting message
This also leads to no sidecars successfully launching and failing with :
2020-03-16T18:32:17.265394Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:19.269334Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:21.265214Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:23.266159Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful,
Weirdly other clusters that I upgraded go through fine. Any idea where this error might be popping up from ? istioctl analyze works fine.
error goes away after killing the nodes (recreating) but istio-proxies still fail with :
info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 1 successful, 0 rejected; lds updates: 0 successful, 0 rejected

As far as I know since version 1.4.4 they add istioctl upgrade, which should be used when You want to upgrade istio from 1.4.x to 1.5.0.
The istioctl upgrade command performs an upgrade of Istio. Before performing the upgrade, it checks that the Istio installation meets the upgrade eligibility criteria. Also, it alerts the user if it detects any changes in the profile default values between Istio versions.
The upgrade command can also perform a downgrade of Istio.
See the istioctl upgrade reference for all the options provided by the istioctl upgrade command.
istioctl upgrade --help
The upgrade command checks for upgrade version eligibility and, if eligible, upgrades the Istio control plane components in-place. Warning: traffic may be disrupted during upgrade. Please ensure PodDisruptionBudgets are defined to maintain service continuity.
I made a test on gcp cluster with istio 1.4.6 installed with istioctl and then I used istioctl upgrade from version 1.5.0 and everything works fine.
kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
istio-ingressgateway-598796f4d9-lvzdb 1/1 Running 0 12m
istiod-7d9c7bdd6-mggx7 1/1 Running 0 12m
prometheus-b47d8c58c-7spq5 2/2 Running 0 12m
I checked the logs and made some simple examples and no errors occurs in istiod like in your example.
Upgrade prerequisites for istioctl upgrade
Ensure you meet these requirements before starting the upgrade process:
Istio version 1.4.4 or higher is installed.
Your Istio installation was installed using istioctl.
I assume because of the differences between 1.4.x and 1.5.0 there might be some issues when you want to use both of the installatio methods, helm and istioctl. The best option here would be to install istio 1.4.6 with istioctl and then upgrade it to 1.5.0.
I hope this answer your question. Let me know if you have any more questions.

Related

Google Cloud SDK throws Reachability Check failed after Command Line Tools update on macOS 12.4

After the software update of Command Line Tools for Xcode to the version 13,4 the gcloud compute ssh command stopped working with the error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate.
I'm not behind proxy or firewall.
What I've tried so far: updating google cloud sdk, then reinstalling, then removing and installing goole cloud sdk from scratch a number of times but the gcloud init command fails to complete with the same error. Downgrading command line tools to 13.2 didn't help. Updating certifi and launching "Install Certificates.command" neither.
output of "gcloud info --run-diagnostics --verbosity debug":
DEBUG: Running [gcloud.info] with arguments: [--run-diagnostics: "True", --verbosity: "debug"]
Network diagnostic detects and fixes local network connection issues.
Checking network connection...⠏DEBUG: Starting new HTTPS connection (1): accounts.google.com:443
Checking network connection...⠛DEBUG: https://accounts.google.com:443 "GET / HTTP/1.1" 302 338
Checking network connection...⠹DEBUG: https://accounts.google.com:443 "GET /ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2F&followup=https%3A%2F%2Faccounts.google.com%2F HTTP/1.1" 302 526
DEBUG: https://accounts.google.com:443 "GET /v3/signin/identifier?dsh=S352504070%3A1656098809680794&continue=https%3A%2F%2Faccounts.google.com%2F&followup=https%3A%2F%2Faccounts.google.com%2F&passive=1209600&flowName=WebLiteSignIn&flowEntry=ServiceLogin&ifkv=AX3vH3-l3sW9otbTScMC6LItjgqZXIpEl6jaKQLX4a-o3Z7M4L5oVPqMq_V_Vltgjce-HlGz4y0mFQ HTTP/1.1" 200 None
Checking network connection...⠼DEBUG: Starting new HTTPS connection (1): cloudresourcemanager.googleapis.com:443
DEBUG: Starting new HTTPS connection (1): www.googleapis.com:443
Checking network connection...⠶DEBUG: Starting new HTTPS connection (1): dl.google.com:443
Checking network connection...⠧DEBUG: https://dl.google.com:443 "GET /dl/cloudsdk/channels/rapid/components-2.json HTTP/1.1" 200 190919
Checking network connection...done.
ERROR: Reachability Check failed.
httplib2 cannot reach https://cloudresourcemanager.googleapis.com/v1beta1/projects:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)
httplib2 cannot reach https://www.googleapis.com/auth/cloud-platform:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)
requests cannot reach https://cloudresourcemanager.googleapis.com/v1beta1/projects:
HTTPSConnectionPool(host='cloudresourcemanager.googleapis.com', port=443): Max retries exceeded with url: /v1beta1/projects (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)')))
requests cannot reach https://www.googleapis.com/auth/cloud-platform:
HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /auth/cloud-platform (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)')))
Network connection problems may be due to proxy or firewall settings.
Do you have a network proxy you would like to set in gcloud (Y/n)? n
ERROR: Network diagnostic failed (0/1 checks passed).
Property diagnostic detects issues that may be caused by properties.
Checking hidden properties...done.
Hidden Property Check passed.
Property diagnostic passed (1/1 checks passed).
DEBUG: (gcloud.info) Some of the checks in diagnostics failed.
Traceback (most recent call last):
File "/Users/gclouder/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 987, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/Users/gclouder/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 809, in Run
resources = command_instance.Run(args)
File "/Users/gclouder/google-cloud-sdk/lib/surface/info.py", line 91, in Run
raise exceptions.Error('Some of the checks in diagnostics failed.')
googlecloudsdk.core.exceptions.Error: Some of the checks in diagnostics failed.
ERROR: (gcloud.info) Some of the checks in diagnostics failed.
output of "gcloud info":
Google Cloud SDK [391.0.0]
Platform: [Mac OS X, x86_64] uname_result(system='Darwin', node='gclouder.local', release='21.5.0', version='Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64', machine='x86_64', processor='i386')
Locale: (None, 'UTF-8')
Python Version: [3.7.9 (v3.7.9:13c94747c7, Aug 15 2020, 01:31:08) [Clang 6.0 (clang-600.0.57)]]
Python Location: [/Users/gclouder/.config/gcloud/virtenv/bin/python3]
OpenSSL: [OpenSSL 1.1.1g 21 Apr 2020]
Requests Version: [2.22.0]
urllib3 Version: [1.25.9]
Site Packages: [Enabled]
Installation Root: [/Users/gclouder/google-cloud-sdk]
Installed Components:
gsutil: [5.10]
core: [2022.06.17]
bq: [2.0.75]
System PATH: [/Users/gclouder/.config/gcloud/virtenv/bin:/Users/gclouder/google-cloud-sdk/bin:/Users/gclouder/.nvm/versions/node/v14.19.0/bin:/Users/gclouder/.jenv/shims:/Users/gclouder/.jenv/bin:/usr/local/opt/mysql#5.7/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin]
Python PATH: [/Users/gclouder/google-cloud-sdk/lib/third_party:/Users/gclouder/google-cloud-sdk/lib:/Library/Frameworks/Python.framework/Versions/3.7/lib/python37.zip:/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7:/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload:/Users/gclouder/.config/gcloud/virtenv/lib/python3.7/site-packages]
Cloud SDK on PATH: [True]
Kubectl on PATH: [/usr/local/bin/kubectl]
Installation Properties: [/Users/gclouder/google-cloud-sdk/properties]
User Config Directory: [/Users/gclouder/.config/gcloud]
Active Configuration Name: [default]
Active Configuration Path: [/Users/gclouder/.config/gcloud/configurations/config_default]
Account: [None]
Project: [None]
Current Properties:
[core]
disable_usage_reporting: [True] (property file)
Logs Directory: [/Users/gclouder/.config/gcloud/logs]
Last Log File: [/Users/gclouder/.config/gcloud/logs/2022.06.24/21.26.47.993939.log]
git: [git version 2.32.1 (Apple Git-133)]
ssh: [OpenSSH_8.6p1, LibreSSL 3.3.6]
Update: it was the corporate antivirus that started behaving this way after a software update

AWS managed hyperledger fabric v1.4.7 blockchain - Getting bad certificate error when connecting to the fabric network

I have deployed a AWS managed Hyperledger Fabric v1.4.7 blockchain. The HLF blockchain network and the EC2 instance (hlf-client) are in the same VPC and everything seems to be working fine since I am able to invoke transactions using the cli container.
I have my client-app which is using fabric-sdk-go gateway API to connect to the fabric network using the connection-profile.yamlto invoke/query the blockchain. This client-app is running in a docker container on same EC2 instance as the cli container which has all the necessary security configuration. The client-app is unable to connect to the fabric network due to a bad certificate error
The error log on the client app is:
[fabsdk/util] 2021/11/02 09:55:17 UTC - lazyref.(*Reference).refreshValue -> WARN Error - initializer returned error: QueryBlockConfig failed: QueryBlockConfig failed: queryChaincode failed: Transaction processing for endorser [nd-cjfwwnimujabllevl6yitqqmxi.m-l3ascxxbincwrbtirbgpp4bp7u.n-rh3k6kahfnd6bgtxxgru7c3b5q.managedblockchain.ap-southeast-1.amazonaws.com:30003]: Endorser Client Status Code: (2) CONNECTION_FAILED. Description: dialing connection on target [nd-cjfwwnimujabllevl6yitqqmxi.m-l3ascxxbincwrbtirbgpp4bp7u.n-rh3k6kahfnd6bgtxxgru7c3b5q.managedblockchain.ap-southeast-1.amazonaws.com:30003]: connection is in TRANSIENT_FAILURE. Will retry again later
The corresponding peer log is:
[36m2021-11-02 10:07:17.789 UTC [grpc] handleRawConn -> DEBU 39501a[0m grpc: Server.Serve failed to complete security handshake from "10.0.2.131:39100": remote error: tls: bad certificate
[31m2021-11-02 10:10:17.809 UTC [core.comm] ServerHandshake -> ERRO 395322[0m TLS handshake failed with error remote error: tls: bad certificate server=PeerServer remoteaddress=10.0.2.131:12696
While invoking transactions using the cli the same certificate files are used. Could anyone tell me what's wrong with my setup here or am I missing any other configuration?
I have generated the ccp (connection-profile.yaml) as below:
---
name: n-RH3K6KAHFND6BGTXXGRU7C3B5Q
version: 1.0.0
client:
organization: Org1
connection:
timeout:
peer:
endorser: "300"
channels:
mychannel:
peers:
nd-CJFWWNIMUJABLLEVL6YITQQMXI:
endorsingPeer: true
chaincodeQuery: true
ledgerQuery: true
eventSource: true
organizations:
Org1:
mspid: m-L3ASCXXBINCWRBTIRBGPP4BP7U
peers:
- nd-CJFWWNIMUJABLLEVL6YITQQMXI
certificateAuthorities:
- m-L3ASCXXBINCWRBTIRBGPP4BP7U
peers:
nd-CJFWWNIMUJABLLEVL6YITQQMXI:
url: grpcs://nd-cjfwwnimujabllevl6yitqqmxi.m-l3ascxxbincwrbtirbgpp4bp7u.n-rh3k6kahfnd6bgtxxgru7c3b5q.managedblockchain.managedblockchain.us-east-1.amazonaws.com:30003
eventUrl: grpcs://nd-cjfwwnimujabllevl6yitqqmxi.m-l3ascxxbincwrbtirbgpp4bp7u.n-rh3k6kahfnd6bgtxxgru7c3b5q.managedblockchain.managedblockchain.us-east-1.amazonaws.com:30004
grpcOptions:
ssl-target-name-override: nd-CJFWWNIMUJABLLEVL6YITQQMXI
tlsCACerts:
path: /home/ec2-user/managedblockchain-tls-chain.pem
certificateAuthorities:
m-L3ASCXXBINCWRBTIRBGPP4BP7U:
url: https://ca.m-l3ascxxbincwrbtirbgpp4bp7u.n-rh3k6kahfnd6bgtxxgru7c3b5q.managedblockchain.managedblockchain.us-east-1.amazonaws.com:30002
httpOptions:
verify: false
tlsCACerts:
path: /home/ec2-user/managedblockchain-tls-chain.pem
caName: m-L3ASCXXBINCWRBTIRBGPP4BP7U
The following solution applies to:
HLF v1.4.7 AWS Managed Blockchain
Fabric client [fabric-sdk-go v1.0.0] Gateway programming model
To resolve the issue just remove the grpcOptions stanza

Istio1.9 integration with virtual machine (aws ec2) getting host file as empty

I have installed mysql in a VM and wanted my EKS with istio 1.9 installed to talk with them, i am following this https://istio.io/latest/docs/setup/install/virtual-machine/ but when am doing this step the host file which getting generated is empty file.
With this empty host file i tried but when starting the vm with this command am getting
> sudo systemctl start istio
when tailed this file
*/var/log/istio/istio.log*
2021-03-22T18:44:02.332421Z info Proxy role ips=[10.8.1.179 fe80::dc:36ff:fed3:9eea] type=sidecar id=ip-10-8-1-179.vm domain=vm.svc.cluster.local
2021-03-22T18:44:02.332429Z info JWT policy is third-party-jwt
2021-03-22T18:44:02.332438Z info Pilot SAN: [istiod.istio-system.svc]
2021-03-22T18:44:02.332443Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2021-03-22T18:44:02.332997Z info Using CA istiod.istio-system.svc:15012 cert with certs: /etc/certs/root-cert.pem
2021-03-22T18:44:02.333093Z info citadelclient Citadel client using custom root cert: istiod.istio-system.svc:15012
2021-03-22T18:44:02.410934Z info ads All caches have been synced up in 82.7974ms, marking server ready
2021-03-22T18:44:02.411247Z info sds SDS server for workload certificates started, listening on "./etc/istio/proxy/SDS"
2021-03-22T18:44:02.424855Z info sds Start SDS grpc server
2021-03-22T18:44:02.425044Z info xdsproxy Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "Kubernetes"
2021-03-22T18:44:02.425341Z info Starting proxy agent
2021-03-22T18:44:02.425483Z info dns Starting local udp DNS server at localhost:15053
2021-03-22T18:44:02.427627Z info dns Starting local tcp DNS server at localhost:15053
2021-03-22T18:44:02.427683Z info Opening status port 15020
2021-03-22T18:44:02.432407Z info Received new config, creating new Envoy epoch 0
2021-03-22T18:44:02.433999Z info Epoch 0 starting
2021-03-22T18:44:02.690764Z warn ca ca request failed, starting attempt 1 in 91.93939ms
2021-03-22T18:44:02.693579Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster istio-proxy --service-node sidecar~10.8.1.179~ip-10-8-1-179.vm~vm.svc.cluster.local --local-address-ip-version v4 --bootstrap-version 3 --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l warning --component-log-level misc:error --concurrency 2]
2021-03-22T18:44:02.782817Z warn ca ca request failed, starting attempt 2 in 195.226287ms
2021-03-22T18:44:02.978344Z warn ca ca request failed, starting attempt 3 in 414.326774ms
2021-03-22T18:44:03.392946Z warn ca ca request failed, starting attempt 4 in 857.998629ms
2021-03-22T18:44:04.251227Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.8.0.2:53: no such host"
2021-03-22T18:44:04.849207Z warn ca ca request failed, starting attempt 1 in 91.182413ms
2021-03-22T18:44:04.940652Z warn ca ca request failed, starting attempt 2 in 207.680983ms
2021-03-22T18:44:05.148598Z warn ca ca request failed, starting attempt 3 in 384.121814ms
2021-03-22T18:44:05.533019Z warn ca ca request failed, starting attempt 4 in 787.704352ms
2021-03-22T18:44:06.321042Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.8.0.2:53: no such host"

CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Spark 0.90 Stand alone connection refused

I am using spark 0.90 stand alone mode.
When I tried with a streaming application in stand alone mode, I am getting a connection refused exception.
I added hostname in /etc/hosts also tried with IP alone. In both cases worker got registered with master without any issues.
Is there a way to solve this issue?
14/02/28 07:15:01 INFO Master: akka.tcp://driverClient#127.0.0.1:55891 got disassociated, removing it.
14/02/28 07:15:04 INFO Master: Registering app Twitter Streaming
14/02/28 07:15:04 INFO Master: Registered app Twitter Streaming with ID app-20140228071504-0000
14/02/28 07:34:42 INFO Master: akka.tcp://spark#127.0.0.1:33688 got disassociated, removing it.
14/02/28 07:34:42 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.165.35.96%3A38903-6#-1146558090] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/02/28 07:34:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#10.165.35.96:8910] -> [akka.tcp://spark#127.0.0.1:33688]: Error [Association failed with [akka.tcp://spark#127.0.0.1:33688]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#127.0.0.1:33688]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:33688
I had a similar issue when running in Spark in cluster mode. My problem was that the server was started with the hostname 'fluentd:7077' and not the FQDN. I edited the
/sbin/start-master.sh
to reflect how my remote nodes connect with the -ip flag.
/usr/lib/jvm/jdk1.7.0_51/bin/java -cp :/home/vagrant/spark-0.9.0-incubating-bin- hadoop2/conf:/home/vagrant/spark-0.9.0-incuba
ting-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.ap
ache.spark.deploy.master.Master --ip fluentd.alex.dev --port 7077 --webui-port 8080
Hope this helps.