To optimize a pipeline, it's important to know the rate-limiting rules or paths. Based on the DAG analyse, is there any approaches that could easily calculate the critical path or the key events ?
This is a broad question to answer, but you may be interested in looking at the --runtime-profile command line argument, which can be used to profile Snakemake code using yappi.
Related
I am writing in AWS CDK a Step Function which run two tasks in parallel. I would like to access from one of the tasks , a value of the second tasks , which runs in parallel (for example, I would like to know in task 1, which is the time started task 2, or maybe id from task 2).
Here an screenshot of the state machine definition in Step Function.
In the example of the screenshot, I would like to use the Id of the GlueStartRunJob (1) in GlueStartRunJob.
I was thinking about using the Context Object for that purpose. Nevertheless, I am not sure if this is the right approach...
The Context Object is read-only and allows a give state to access contextual information about it's self, not about other states from elsewhere in the workflow.
I'm not 100% clear what you are aiming to accomplish here, but I can see a couple of possible approaches.
First, you might just want to order these Glue Jobs to run sequentially so the output from the first can be used in the second.
Second, if you need the workflow to take action after the Glue Jobs have started but before they have completed, you'd need to take an approach that does not use the .sync integration pattern. With this integration pattern, Step Functions puts a synchronous facade over an asynchronous interaction, taking care of the steps to track completion and return you the results. You could instead use the default RequestResponse pattern to start the jobs in your parallel state, then do whatever you needed to after. You'd need to then include your own polling logic if you wanted the workflow to wait for completion of the jobs and return data on them or take action on completion. You can see an example of such polling for Glue Crawlers in this blog post (for which you can find sample code here).
I'm practicing for the Data Engineer GCP certification exam and got the following question:
You have a Google Cloud Dataflow streaming pipeline running with a
Google Cloud Pub/Sub subscription as the source. You need to make an
update to the code that will make the new Cloud Dataflow pipeline
incompatible with the current version. You do not want to lose any
data when making this update.
What should you do?
Possible answers:
Update the current pipeline and use the drain flag.
Update the current pipeline and provide the transform mapping JSON object.
The correct answer according to the website 1 my answer was 2. I'm not convinced my answer is incorrect and these are my reasons:
Drain is a way to stop the pipeline and does not solve the incompatibility issues.
Mapping solves the incompatibility issue.
The only way that I see 1 as the correct answer is if you don't care about compatibility.
So which one is right?
I'm studying for the same exam, and the two cores of this question are:
1- Don't lose data ← Drain, is perfect for this because you process all buffer data and stop reviving messages; normally this message is alive for 7 days of retry, so when you start a new job you will receive all without lose any data.
2- Incompatible new code ← mapping solve some incompatibilities like change name of a ParDO but no a version issue. So launch a new job with the new code, it's the only option.
So, option is A.
I think the main point is that you cannot solve all the incompatibilities with the transform mapping. Mapping can be done for simple pipeline changes (for example, names), but it doesn't generalize well.
The recommended solution is constantly draining the pipeline running a legacy version, as it will stop taking any data from reading components, finish all the work pending on workers, and shutdown.
When you start a new pipeline, you don't have to worry about state compatibility, as workers are starting fresh.
However, the question is indeed ambiguous, and it should be more precise about the type of incompatibility or state something like in general. Arguably you can always try to update the job with mapping, and if Dataflow finds the new job to be incompatible, it will not affect the running pipeline -- then your only choice would be the drain option.
I'd like to sketch out some steps in a SageMaker pipeline, and only fill them in one at a time, but I don't think there's an EmptyStep option anywhere.
I've considered using some vacuously true ConditionalSteps, or subclassing sagemaker.workflow.steps.Step, but the former can't be chained, and the latter seems likely to break things, given my implementation wouldn't necessarily conform to what the service is looking for.
Is there a good way to go about this? An empty processor step?
There's no way to create your own empty step in a SageMaker Pipeline. The easiest way for you to achieve this would be to use a LambdaStep and create stub Lambda functions. With a processor you will pay the penalty of cold start for each job.
I work at AWS and my opinions are my own.
Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.
But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.
Can somebody clarify which is the better practice?
Writing the dvc.yaml or let it be generated by dvc run command?
Or is it left to user's choice and there is no technical difference?
I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)
dvc stage add can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml, for example setting vars values or defining foreach stages.
Both, really.
Primarily dvc run (or the newer dvc stage add followed by dvc exp run) is meant to mange your dvc.yaml file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config and directly modifying .{git,dvc}/config)
However as you note, dvc.yaml is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach stages).
The way "fetch" materials works is that the latest "passed" build is transferred to the downstream pipelines.
Is it possible to do even if the upstream stage fails ?
I don't think that a stage failure even triggers the next stage or next pipeline, so nothing runs that could fetch the failed material.
Is it possible to do even if the upstream stage fails ?
No. It's not possible.
"Stages are meant to run sequentially". Why?
Mostly you should design your problem using stages in a way that they are dependent and sequential.
Like, "build > unit test > integration test > deploy.
If you look at the sequence above, it doesn't make sense to continue to next step if the previous one fails. So in go-cd stages are implemented to achieve this dependency pattern.
Maybe your requirement might be correct, but stages might not be the solution for that problem. I would suggest you re-think about why you want to do that and use correct abstraction in go-cd for that problem.
Gocd has pipelines,stages,jobs and tasks. Check what best fits your situation and apply it.