Provisioning Issues in the DataFusion - google-cloud-platform

When DataFusion runs a data pipeline, it persists in the provisioning state and then stops.
As a result, Dataproc cannot be created.
Dataproc's settings are as follows:
- Master
- Number of masters : 1
- Master Cores : 2
- Master Memory(GB) : 4
- Master Disk Size(GB) : 1000
- Worker
- Number of Workers : 2
- Worker Cores : 4
- Worker Memory(GB) : 16
- Worker Disk Size(GB) : 1500
In the data pipeline, the driver and executor are as follows:
- Executor
- CPU : 2
- Memory : 4
- Driver
- CPU : 2
- Memory : 4
If I actually look at dataproc in Google Cloud Console window, it will be provisioned and then disappear. Please share your opinion on how to solve this problem.

Related

Redisson does not recover after redis master fail over

We are using Redisson 3.17.0 and redis version 6.0.8. We have redis cluster mode setup with 3 masters and each master has about 4-5 replicas. When redis master fail over happens, redisson starts throwing exceptions that it is unable to write command into connection. Even after fail over completes (which is ~30s or so), the exceptions don't stop. Only a bounce of the instance that runs redisson resolves this error. This is affecting high availability of our service. We have pingConnectionInterval set to 5000 ms. Our read mode is only from Masters.
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Try to increase nettyThreads setting. Payload size in bytes: 81. Node source: NodeSource [slot=10354, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection#1578264320 [redisClient=[addr=rediss://-:6379], channel=[id: 0xb0f98c8c, L:/-:55678 - R:-/-:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); after 2 retry attempts
Following is our redisson client config:
redisClientConfig: {
endPoint: "rediss://$HOST_IP:6379"
scanInterval: 1000
masterConnectionPoolSize: 64
masterConnectionMinimumIdleSize: 24
sslEnableEndpointIdentification: false
idleConnectionTimeout: 30000
connectTimeout: 10000
timeout: 3000
retryAttempts: 2
retryInterval: 300
pingConnectionInterval: 5000
keepAlive: true
tcpNoDelay: true
dnsMonitoringInterval: 5000
threads: 16
nettyThreads: 32
}
How can redisson recover from these exceptions without a restart of the application? We tried increasing netty threads etc, but redisson does not recover from the fail over

Understanding Kubernetes cluster scaling

Using AWS EKS with t3.medium instances so I have (2 VCPU = 2000 cores and 4gb ram).
Running 6 different apps on the node with these cpu request definitions:
name request replica total-cpu
app#1 300m x2 600
app#2 100m x4 400
app#3 150m x1 150
app#4 300m x1 300
app#5 100m x1 100
app#6 150m x1 150
With basic math I can say whole apps consume 1700m cpu cores. Also I have hpa with 60% cpu limit for app#1 and app#2. So, I am expecting to have just one node, or maybe two nodes (because of kube-system pods), but the cluster always running with 3 nodes. It looks like I understood autoscaling thing wrong.
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-*.eu-central-1.compute.internal 221m 11% 631Mi 18%
ip-*.eu-central-1.compute.internal 197m 10% 718Mi 21%
ip-*.eu-central-1.compute.internal 307m 15% 801Mi 23%
As you see it's just using 10-15% of nodes. How can I optimize node scaling? What is the reason to have 3 nodes?
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app#1 Deployment/easyinventory-deployment 37%/60% 1 5 3 5d16h
app#2 Deployment/poolinventory-deployment 64%/60% 1 5 4 4d10h
UPDATE #1
I have pod disruption budget for kube-system pods
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f
UPDATE #2
Figured out 3rd node is not always live, k8s scaling down to 2 nodes but after a few minutes, scaling up again to 3 nodes and later down to 2 nodes again and again.
kubectl describe nodes
# Node 1
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1010m (52%) 1300m (67%)
memory 3040Mi (90%) 3940Mi (117%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
# Node 2
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1060m (54%) 1850m (95%)
memory 3300Mi (98%) 4200Mi (125%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
UPDATE #3
I0608 11:03:21.965642 1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976 1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996 1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175 1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202 1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257 1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336 1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359 1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411 1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460 1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469 1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578 1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667 1 scale_down.go:785] No candidates for scale down
Update #4
According to autoscaler logs, it was ignoring the ip-*145.eu-central-1.compute.internal to scale down, for some reason, I wonder what will happen and terminated the instance from EC2 console directly. And these lines appeared in autoscaler logs:
I0608 11:10:43.747445 1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477 1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540 1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549 1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557 1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569 1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
As far as I see, the node is not scaling down because there are no other nodes to fit "app2". But app memory request is 700Mi and at the moment other nodes have enough memory for the app2
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal 386m 20% 920Mi 27%
ip-10-0-1-115.eu-central-1.compute.internal 298m 15% 794Mi 23%
Still no idea why autoscaler is not moving app2 to one of other available nodes and scale down the ip-*-145.
How Pods with resource requests are scheduled.
A request is the amount guaranteed for the container. So the scheduler will not schedule a pod to a node without enough capacity. In your case, the 2 existing nodes already cap their mem (0.9 and 0.98). So ip-*-145 cannot be scaled down otherwise app2 has nowhere to go.

How to optimally set spark driver properties in YARN

I am trying out various options for setting spark driver memory in yarn.
Use Case:
I have a spark cluster with 1 master and 2 slaves
master : r5d xlarge - 8 vcore, 32GB
slave : r5d xlarge - 8 vcore, 32GB
I am using Apache Zeppelin to run the queries on spark cluster. Spark interpreter is configured with properties provided by Zeppelin. I am using spark 2.3.1 running on YARN. I want to create 4 interpreters so that 4 users can parallelly use this cluster.
Config 1:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 24G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Config 2:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 12G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Questions:
Why is the container size of driver 0?
Is the spark.memory.fraction calculated as (spark.driver.memory-300)*0.6 ? If so, why is it not exact ? (14.22, 7.02 resp)
Why is the container size of executor 3.8 GB ? According to my configuration, it should be 3G + 512M = 3.5 GB. This issue was not there with spark 2.1
No of VCores available to YARN is 8 per node. How is this possible since AWS gives vCPU with their instances? Hence I should only be getting 4 VCores according to AWS.
https://aws.amazon.com/ec2/instance-types/r5/
If I want to use 4 interpreters, should I distribute 32 GB of master equally to all the interpreters?
Driver:
spark.driver.cores 2
spark.driver.memory 7G
spark.driver.memoryOverhead 1024M

Unable to create kafka topic

I am trying to create a kafka topic on ec2 instance,
i am following this documentation https://aws.amazon.com/blogs/big-data/real-time-stream-processing-using-apache-spark-streaming-and-apache-kafka-on-aws/
but i am getting the following error please help
ec2-user#ip-10-100-53-218 bin]$ ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N
Error while executing topic command : replication factor: 1 larger than available brokers: 0
[2017-03-20 12:25:30,045] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor: 1 larger than available brokers: 0
(kafka.admin.TopicCommand$)
The kafka broker is not running. SSH into the Kafka Broker instance and check if kafka-server-start.sh is running.
ps -ef | grep kafka-server-start
If not running, start it.
nohup /app/kafka/kafka_2.9.2-0.8.2.1/bin/kafka-server-start.sh /app/kafka/kafka_2.9.2-0.8.2.1/config/server.properties

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar