I have an Azure Data Factory Pipeline where it has 24 parallel jobs : databricks notebooks.
In the Pipeline configuration I set concurrency to 4 but when I run the Pipeline the 24 jobs start to run in parallel although I want just 4 of them to start running in the first place.
According to the documentation here , I can run 4 parallel jobs and the others will be in queued status.
am I missing another paramater to configure ?
Thank you in advance :)
The link which you are following is for ADF v1 , it clearly mention's that on top of the doc . The feature which you are trying to use is for adf v2 .
"[!NOTE] This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see pipeline execution and triggers article."
Related
I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:
The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run
the models for user6-10, etc.. ??
Could you please share some ideas or code on that topic for better understanding? Much appreciated.
AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.
Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.
Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.
You can see an example in the AWS Batch docs for how to use the index.
Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
In a pipeline we have 3 projects bind deployed, in the first stage we retrieve all the projects, And in subsequent stages we deploy and run test on each, that will total 4 stages, 1 for getting the sources and 1 each for deployment, test and other actions. Our change release are triggered by any commit done to any of the projects in the pipeline.
Normally this works ok but apparently AWS pipeline doesn't queue the change release and can trigger one after the other if a commit is done while a change release is running, so it will run in parallel in the same instance (ec2), and subsequently generate errors. Is there a way to configure a queue for the AWS pipeline release change? This discarding the option of manual approvals.
Thanks for the help in advance.
Based on your description it sounds like you have three projects in one pipeline with a stage for each project and one EC2 instance.
Why not create an independent pipeline for each project? Otherwise it sounds like you need mutual exclusion across the project stages. You could combine the three stages and let CodePipeline enforce one pipeline execution at a time occupying a stage.
I should probably mention based on your question that CodePipeline is intended for continuous delivery and it's desirable to have multiple changes moving through the pipeline at the same time. This is more obvious with deep pipelines (i.e. if it takes 3 days to fully release a change, you probably don't want to wait 3 days before a new change can start traversing the pipeline).
We are experimenting with Hadoop and processing of the Common Crawl.
Our problem is that if we create a cluster with 1 Master Node and 1 Core and 2 Task nodes, only one of the nodes per group will get high CPU/Network usage.
We tried also with 2 Core and no Task nodes, but in this case also only one Core node was used.
Following some screenshots of the Node/Cluster monitoring. The job was running all the time (in the first two parallel map phases), and should have used most of the available CPU power, as you can see in the screenshot of the working Task node.
But why is the idle Task node not utilized?
Our hadoop job, running as an Jar step, has no limits for the map jobs. It consists of multiple map/reduce steps chained. The last reduce job is limited to one Reducer.
Screenshots:
https://drive.google.com/drive/folders/1xwABYJMJAC_B0OuVpTQ9LNSj12TtbxI1?usp=sharing
ClusterId: j-3KAPYQ6UG9LU6
StepId: s-2LY748QDLFLM9
We found the following in the System Logs of the idle Node during an other run, maybe it is an EMR problem?
ERROR main: Failed to fetch extraInstanceData from https://aws157-instance-data-1-prod-us-east-1.s3.amazonaws.com/j-2S62KOVL68GVK/ig-3QUKQSH7YJIAU.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X
Greetings
Lukas
Late to the party, but have you tried setting these properties as part of the spark submit command.
--conf 'spark.dynamicAllocation.enabled=true'
--conf 'spark.dynamicAllocation.minExecutors=<MIN_NO_OF_CORE_OR_TASK_NODES_YOU_WANT>'
I created a Data service project and enabled Boxcar for running 5 queries sequentially.
after deploying service, I need to use schedule task for running it every 5 minutes. in schedule task, I selected _request_box operation(It was created by DSS boxcar) but it doesn't work. how can i use task schedule with boxcarring?
Thank you
When a task is scheduled the operation should be a parameter-less operation. As request_box consists of several other operations, this scenario will not work as a normal operation. I have added a JIRA to report this scenario and you can track the progress from there.