Capacity-Scheduler Not Running multiple Jobs submitted in different queues - amazon-web-services

So we are new to Capacity scheduler. We are spinning up an AWS Cluster where we want to add capacity scheduler configuration to have the jobs running simultaneously in different different queue.
The issue is, even though we are managing to create a stable cluster with the scheduler config, but then we are unable to submit the jobs parallely in each queue.
By referring to the below link , we created the configurations by providing he respective values.
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
With all parameters added as given in the above link, the cluster fails while bootstrapping.
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.acl_administer_queue=yarn
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.default.acl_submit_applications=yarn
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor=2
yarn.scheduler.capacity.root.queues=bt,default,opt
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.bt.acl_administer_queue=*
yarn.scheduler.capacity.root.bt.acl_submit_applications=*
yarn.scheduler.capacity.root.bt.capacity=25
yarn.scheduler.capacity.root.bt.maximum-capacity=100
yarn.scheduler.capacity.root.bt.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.bt.ordering-policy=fair
yarn.scheduler.capacity.root.bt.ordering-policy.fair.enable-size-based-weight=false
yarn.scheduler.capacity.root.bt.priority=0
yarn.scheduler.capacity.root.bt.state=RUNNING
yarn.scheduler.capacity.root.bt.user-limit-factor=1
yarn.scheduler.capacity.root.default.acl_administer_queue=yarn
yarn.scheduler.capacity.root.default.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.default.ordering-policy=fair
yarn.scheduler.capacity.root.default.ordering-policy.fair.enable-size-based-weight=false
yarn.scheduler.capacity.root.default.priority=0
yarn.scheduler.capacity.root.opt.acl_administer_queue=*
yarn.scheduler.capacity.root.opt.acl_submit_applications=*
yarn.scheduler.capacity.root.opt.capacity=25
yarn.scheduler.capacity.root.opt.maximum-capacity=25
yarn.scheduler.capacity.root.opt.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.opt.ordering-policy=fair
yarn.scheduler.capacity.root.opt.ordering-policy.fair.enable-size-based-weight=false
yarn.scheduler.capacity.root.opt.priority=0
yarn.scheduler.capacity.root.opt.state=RUNNING
yarn.scheduler.capacity.root.opt.user-limit-factor=1
yarn.scheduler.capacity.root.priority=0
Some how with above configuration we are able to create a cluster but some issues which we are facing are:
1. The job runs perfectly fine in default queue and if submitted in other queues, its stuck in ACCEPTED state.
2. Only one job is getting submitted at a time and other jobs are still waiting in EMR steps instead of running under different different queues.
P.S: The jobs we are submitting on to the EMRs are spark jobs triggered from Lambda function.

Related

How should I pull from Pub/Sub using Compute Engine MIGs

In my personal case, Pub/Sub's pushes to a Python service on Cloud Functions are being unfeasible due to it's short timeout. So the idea of having a container-based managed instance group of Compute Engine instances sounds good, these instances can scale up/down based on Pub/Sub pending task count metrics. These machines' containers would run Python code on startup, the given code would PULL Pub/Sub and process the pulled job accordingly.
Contextualization aside, the question is: Is it a good idea? Are there any gotchas? As there would be several machines at scale, how could I guarantee that a same given 'queued task' would not be picked and have it's processing started on more than one of these machines? I know about ACKs, but ACKs should just be emitted when the task ends successfully, isn't it? What strategy to use to prevent the initially mentioned and other problems?

Why is it not provisioned when running a data pipeline in a data fusion?

I am using DataFusion Enterprise.
Datafusion>system admin>configuration>system compute profiles>create new profile
on this route
I set the configuration value of Master Node, Worker Node.
And I set configuration for each data pipeline. (Executor, Driver)
Now, while deploying and running the data pipeline, the provisioning state does not move on to the next startup phase.
The issue is as follows.
1.Dataproc
CreateCluster
asia-northest3:cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf883
service-125051400193#gcp-sa-datafusion.iam.gserviceaccount.com
Multiple Errors: - Timeout waiting for instance cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf88803-m to report in. - Timeout waiting for instance cdap-eventgmt-7769-9d-1d-1d1
Dataproc
DeleteCluster
asia-northest3:cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf883
service-125051400193#gcp-sa-datafusion.iam.gserviceaccount.com
Cannot delete cluster 'cdap-eventgmkt-7e769e35-182d-11eb-9d9d-ce8dcdf88803' when it has other pending delete operations.
In short, the data pipeline is running, but it's no longer being provisioned.
How do you solve these problems?
Thank you.

Which Queue will be used while running this job is sidekiq

Hi I am using sidekiq_cron in my rails project. Jobs classes are from ActiveJob. In my job file I have queue_as :default and in schedule.yml file I have queue: high_priority. In actual which queue will be used?
Are you talking about sidekiq-cron? It seems that it allows you to pick the queue so one should expect high_priority queue to be used when jobs are scheduled by cron, and default queue to be used when the same jobs are scheduled by other means (other Ruby code that is not the cron).
You have a way to confirm this, spin up locally a sidekiq that does not process jobs in any of these queues, set your cron to every minute, and in your Sidekiq web dashboard you will be able to see jobs accumulating on the Queues tab.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks