execluding specific nodemanagers while running a job - mapreduce

I am using hadoop-2.7.3. How i can i exclude specific nodemanagers only for some jobs. I tried some configurations but it was applicable to resourcemanager level not with nodemanager.Is it possble to achieve this by any mapreduce job level property.
example: I want to run job1 in a nodemanager group1(5 servers), job2 in nodemanager group2(other 5servers)

You can exclude nodes if you're using Capacity Scheduler. The feature is called Node labels. Refer to YARN Node Labels for more information.

Related

How do I see monitoing jobs in GCP console?

I have created a monitoring job using create_model_deployment_monitoring_job. How do I view it in GCP Monitoring?
I create the monitoring job thus:
job = vertex_ai_beta.ModelDeploymentMonitoringJob(
display_name=MONITORING_JOB_NAME,
endpoint=endpoint_uri,
model_deployment_monitoring_objective_configs=deployment_objective_configs,
logging_sampling_strategy=sampling_config,
model_deployment_monitoring_schedule_config=schedule_config,
model_monitoring_alert_config=alerting_config,
)
response = job_client_beta.create_model_deployment_monitoring_job(
parent=PARENT, model_deployment_monitoring_job=job
)
AI Platform Training supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.
As you are using Vertex AI, you can check the job status in the Vertex AI dashboard. In GCP Console search for Vertex AI , enable API or click on this link and follow this Doc for Job status
Then following this Link summarizes the job operations and lists the interfaces you can use to perform them and also to know more information about Jobs follow this link

How to Get the "templateLocation" parameter value of an existing Job in google dataflow

I have a list of an existing jobs running in a google dataflow. I would like to list the jobs running for last x number of days and would like to recycle those programmatically. To achieve this I need the name of the template used for a particular job. We can easily get this information from Console in a Job Info view. However i would like to know if there is any way to get this info from Gcloud command or from a API.
Your early response will be appreciated.
Thanks
Sarang
- Solution 1 :
You can use GCloud sdk and a shell script to achieve your need :
https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/list
Filter jobs with the given name:
gcloud dataflow jobs list --filter="name=my-wordcount"
List jobs with from a given region:
gcloud dataflow jobs list --region="europe-west1"
List jobs created this year:
gcloud dataflow jobs list --created-after=2018-01-01
List jobs created more than a week ago:
gcloud dataflow jobs list --created-before=-P1W
Many filters and parameters are proposed to apply your use case.
- Solution 2
You can use the rest api for Dataflow jobs :
https://cloud.google.com/dataflow/docs/reference/rest
Example :
GET /v1b3/projects/{projectId}/locations/{location}/jobs
List the jobs of a project.
There is no direct way of getting the template location and name. To achieve above requirement I have defined the pattern for the template name and job name and based on the defined pattern, while recycling the job, template name has been computed and passed on to the API call.

Trigger event when specific 3 AWS jobs are completed

I have submitted 3 jobs in parallel in AWS Batch and I wanted to create a trigger when all these 3 jobs are completed.
Something like I should be able to specify 3 job ids and can update DB once all 3 jobs are done.
I can do this task easily by having long pooling but wanted to do something based on events.
I need your help with this.
The easiest option would be to create a fourth Batch job that specifies the other three jobs as dependencies. This job will sit in the PENDING state until the other three jobs have succeeded, and then it will run. Inside that job, you could update the DB or do whatever other actions you wanted.
One downside to this approach is that if one of the jobs fails, the pending job will automatically go into a FAILED state without running.

Rundeck job with Ansible's dynamic inventory (ec2.py)

How do I configure Rundeck in a way that I can execute a job through Ansible over a couple of AWS Ec2 instances? I am using Batix plugin but i believe that it is not configured properly or some personal configuration is missing.
My idea is to trigger a job from Rundeck without defining static inventories on Rundeck and Ansible, if possible. (I add that Ansible + ec2.py and ec2.ini works properly without Rundeck)
Below a snippet of my the configuration file of inventory settings.
project.ansible-generate-inventory=true
resources.source.1.config.ansible-gather-facts=true
resources.source.1.config.ansible-ignore-errors=true
resources.source.1.config.ansible-inventory=/{{ VAR }}
resources.source.1.type=com.batix.rundeck.plugins.AnsibleResourceModelSourceFactory
for VAR I tried these values = etc/ansible/hosts ..... /ec2.py ..... /ec2.py -- list ..... /tmp/data/inventory
You can use Dynamic inventory under Rundeck, take a look at this GitHub thread. Another way is to create a node source like this. Alternatively, you can use the Rundeck EC2 plugin to get directly the AWS EC2 nodes. Take a look at this.

Dataproc client : googleapiclient : method to get list of all jobs(runnng, stopped .. etc) in a cluster

We are using Google Cloud Dataproc to run sparkJobs.
We have a requirement to get a list of all jobs and its states corresponding to a cluster.
I can get the status of a job, if I know the job_id, as below
res = dpclient.dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId="ab4f5d05-e890-4ff5-96ef-017df2b5c0bc").execute()
But , what if I dont know the job_id, and want to know the status of all the Jobs
To list jobs in a cluster, you can use the list() method:
clusterName = 'cluster-1'
res = dpclient.dataproc.projects().regions().jobs().list(
projectId=project,
region=region,
clusterName=clusterName).execute()
However, note that this only currently supports listing by clusters which still exist; even though you pass in a clusterName, this is resolved to a unique cluster_uuid under the hood; this also means if you create multiple clusters of the same name, each incarnation is still considered a different cluster, so job listing is only performed on the currently running version of the clusterName. This is by design, since clusterName is often reused by people for different purposes (especially if using the default generated names created in cloud.google.com/console), and logically the jobs submitted to different actual cluster instances may not be related to each other.
In the future there will be more filter options for job listings.