I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.
Related
I am new to Kafka and I setup an instance in aws. runs well.
then I created another aws instance and run the codes:
See image here
it can print out messages that I published to kafka
If I ran the same codes in the kafka server itself, I can also get messages.
However, if I run the same codes in my own laptop, I cant get anything.
I thought it might be the codes so I used kafka's own client in my laptop:
bin/kafka-console-consumer.sh --topic test22 --bootstrap-server 34.215.180.111:9092
Now I got an error:
2021-05-11 16:21:32,252] WARN [Consumer clientId=consumer-console-consumer-94326-1, groupId=console-consumer-94326] Error connecting to node ip-172-31-29-222.us-west-2.compute.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
ip-172-31-29-222.us-west-2.compute.internal
this piece of name is actually the AWS instance's internal address:
See image here
Then I thought it might be Amazon's issue so I repeated the whole process in Google Cloud and got the same results:
[2021-05-11 17:15:34,840] WARN [Consumer clientId=consumer-console-consumer-2377-1, groupId=console-consumer-2377] Error connecting to node instance-1.us-central1-a.c.seventh-seeker-267203.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
These internal addresses can not be accessed from external computers at all.
Can anybody help? thanks!
The logs are showing you the advertised.listeners of the brokers. If you want that to be different in order to connect, you'll need to modify that property such that the brokers have resolvable addresses for the clients
https://www.confluent.io/blog/kafka-listeners-explained/
when I ran the workflow manager getting the error message at add host to service bus farm.
We have the SharePoint as standalone, OS is Windows server 2012 r2
SQL server 2016 developer.
Followed below two url's for installing
https://collab365.community/configuring-sharepoint-2013-to-support-workflow-management/
https://www.c-sharpcorner.com/article/workflow-manager-configuration-for-sharepoint-server-2013/ unable to under stand the issue where exactly.
please find the below log file
[Verbose] [12/10/2018 4:43:54 PM]: Service Bus services starting.
[Progress] [12/10/2018 4:43:54 PM]: Service Bus services starting.
[Error] [12/10/2018 4:53:55 PM]: System.Management.Automation.CmdletInvocationException: Starting service Service Bus Message Broker failed: Time out has expired and the operation has not been completed. ---> Microsoft.ServiceBus.Commands.Common.Exceptions.OperationFailedException: Starting service Service Bus Message Broker failed: Time out has expired and the operation has not been completed.
at Microsoft.ServiceBus.Commands.Common.SCMHelper.StartService(String serviceName, Nullable1 waitTimeout, String hostName)
at Microsoft.ServiceBus.Commands.ServiceBusConfigHelper.StartSBServices(String hostName, Nullable1 waitTimeout)
at Microsoft.ServiceBus.Commands.AddSBHost.ProcessRecordImplementation()
--- End of inner exception stack trace ---
at System.Management.Automation.Runspaces.AsyncResult.EndInvoke()
at System.Management.Automation.PowerShell.EndInvoke(IAsyncResult asyncResult)
at Microsoft.Workflow.Deployment.ConfigWizard.CommandletHelper.InvokePowershell(Command command, Action`3 updateProgress)
at Microsoft.Workflow.Deployment.ConfigWizard.ProgressPageViewModel.AddSBNode(FarmCreationModel model, Boolean isFirstCommand)
please let me know how to resolve this issue for installing the workflowmanager.
what worked for me was enabling TLS 1.0 in the registry.
in my case I don't have registry of client but only enabled the server one
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.0\Client]
"Enabled"=dword:00000001
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.0\Server]
"Enabled"=dword:00000001
fyi... I was stopped the Service Bus Message Broker while the workflow manager configuration wizard was running in the "add host to service bus fam" task, then the changes the wizard complete successfully. I hope so much you can resolve this issue :)
this is the link where I fund the answers http://answersweb.azurewebsites.net/MVC/Post/Thread/e6667e72-36db-44d7-bcb9-0d537cd19542?category=workflow and is the CRBenson post, thank you very much
I had almost same issue. Installing the correct patch fixed the issue.
Complete details on below thread.
http://fixingsharepoint.blogspot.com/2021/02/service-bus-gateway-service-stuck-at.html
I am getting an error like
Did not respond to the ping, removing transaction processor.
Can anyone guide what is the error here/ If there is any problem with my set-up?
This is a message from the Hyperledger Sawtooth blockchain's Validator.
A timeout occurred when the Validator was checking connections with all the registered transaction processors. If a transaction processor does not respond, it is removed from the list.
Some possible causes: the transaction processor (TP) died. Check that the TP process is still running (check in the Docker container if you are running docker). Check network connectivity if the TP is on another host or another virtual machine. Check the message logs. Perhaps the TP is "frozen" or hanging or has a bug. Add logging messages (using LOGGER).
Installing wso2am-analytics-2.2.0 on the port offset 0, then I get error messages as
WARN {org.apache.spark.scheduler.TaskSetManager} - Lost task 0.0 in stage 2990.0 (TID 147439, 10.0.11.26): FetchFailed(BlockManagerId(0, someserver.compute.internal, 12001), shuffleId=745, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-0-17-131.eu-central-1.compute.internal:12001
Apparently somewhere is configured to connect to port 12001 (while seems the server listens on 12000)
Where could I configure the port 12000?
Thanks
This port is defined in <Product_Home>repository/conf/analytics/spark/spark-defaults.conf. Property name is spark.blockManager.port. However you shouldn't manually configure it.
This particular issue is a connectivity problem in my knowledge. DAS uses 1200x range ports to spark executor communications. So incase of multiple executors or new executor spawning in and event of one executor getting killed incremented port will be opened. Hence at the network level also we should allow traffic through that port range. So opening that port range in your network interface ip-10-0-17-131.eu-central-1.compute.internal will solve your issue.
I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
But now if I try the same thing from local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
it can't seem to connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.
Any help will be greatly appreciated!
Others are asking same question on mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465
The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your spark task.
In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.
Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.
Even less likely, your cluster may actually be down, and you need to restart your spark workers.