I am getting Error while creating table in zeppelin as , "Could not load table TEST_TABLE_1 : There are no Velocity servers up."
All services in AMBARI are in green status. Tried restarting servers , instances and all services but still error exists.
Please comment.
Please verify if the Vora servers are running on your Spark worker nodes:
ps -efa | grep v2server
Could you provide the command you are executing and a bit more details about the error message (e.g. 1-2 lines before; 1-2 lines after).
Where are you running your cluster (AWS, datacenter, laptop,...)?
Related
I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?
Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).
I am new at Redis Enterprise and can't fix this problem:
I have a Redis Enterprise cluster (v.6.0) in AWS with two nodes. When I have only one node I can enter UI, but after adding other (second) nodes always throws me out to the login page after entering credentials. Meanwhile, the cluster works fine (information is taken from rladmin).
In what direction I should investigate the issue?
P.S.: Can this error from logs cause an issue?
ERROR redis_mgr MainThread: Connect failed: connect: connection failed: Error 2 connecting to unix socket: /var/opt/redislabs/run/ccs.sock. No such file or directory.: retrying
Possibly, this solution will help anybody:
the reason was that ALB before UI didn't use sticky sessions.
the solution was to enable a sticky session and it works.
So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?
So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.
I have HDP Hortonworks 2.5.3 cluster, MAPREDUCE jobs in YARN are getting failed with the error:
java.io.IOException: DistCp failure: Job job_1498784032636_0015 has
failed:
Application application_1498784032636_0015 failed 2 times due to AM Container for appattempt_1498784032636_0015_000002 exited with
exitCode: -1000 For more detailed output, check the application
tracking page:
http://asterdart0005.labs.teradata.com:8088/cluster/app/application_1498784032636_0015 Then click on links to logs of each attempt. Diagnostics: Application
application_1498784032636_0015 initialization failed (exitCode=255)
with output: main : command provided 0 main : run as user is hdfs main
: requested yarn user is hdfs Requested user hdfs is banned
later i googled, it seems the hdfs user is banned user, as per the configuration in the file /etc/hadoop/conf/container-executor.cfg on each node, here is the content of the file:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
I have modified the file in all nodes (namenode, edge and data nodes), as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
#banned.users=hdfs,yarn,mapred,bin
min.user.id=500
and restarted all services in HDFS, YARN and MapReduce2 through Ambari, after restarting my jobs are failing with the same error, and checked the /etc/hadoop/conf/container-executor.cfg content, looks it reset to initial stage as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
any idea whats the solution here, to remove the users from the banned users list?
First thing to note is , you can not comment banned_users line, instead set correct users in value of banned_users list. (i.e. if you do not want to ban user hdfs then change banned.users=hdfs,yarn,mapred,bin to banned.users=yarn,mapred,bin). If you comment banned_users list then anyway by default hdfs, yarn and mapred will be banned.
Another thing, you can follow steps given below to propagate changes to all nodes.
Go to Ambari server node
Modify /var/lib/ambari-server/resources/common-services/YARN/<version>/package/templates/container-executor.cfg.j2 to configure banned users.
Restart Ambari server and all Ambari agents
I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
But now if I try the same thing from local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
it can't seem to connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.
Any help will be greatly appreciated!
Others are asking same question on mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465
The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your spark task.
In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.
Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.
Even less likely, your cluster may actually be down, and you need to restart your spark workers.