Concurrent bamboo builds on different remote agents - build

I have 2 remote agents for the build, what is the entire process to fire a same build using bamboo,concurrently on those two different agents.

First of all, change your build requirement variable so that those two agents can both build this plan.
Then change the max concurrent build variable to >= 2.
Then I think you can trigger the same plan twice, and see they're running on two agents.
But this is not very recommended since there's a slight chance the agents will shuffle jobs.
Build 1 has 3 stages which ran first on agent #1, Build 2 ran after on agent #2, by the time Build 2 is in stage 3, it might be running on agent#1 or agent#2 depending on whether the agent is idle.

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Detect updates to AWS StepFunctions State Machine definition inside a Choice state

This is a really good pattern for restarting very-long running state machine executions based on an iteration count so we don't breach the Standard quotas of 1 year execution time and 25k events - https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-continue-new.html
My Question: Is it possible to detect if the state machine definition has changed (since the start of the execution) in a Choice state? For eg., in the IsCountReached state above.
We are planning to handle the State Machine creation and updation using AWS CDK. This would enable us to completely automate the deployments to State Machines, instead of manually killing the execution and restarting it after changes to the State Machine.
As far as I know there is no such thing. It does not really make sense either, since a state machine is run on a "version" of your state machine definition. When you change your definition (new version), you typically don't want running processes to be influenced by that, since that might have unexpected consequences.
That said, you should be able to build something like this fairly easy: build a Lambda function that finds currently running state machines, stops them and restarts them. You invoke this Lambda function as part of your deployment process, if your definition changed.
This way, if your deployment contains changes to your state machine, all your currently running state machines would be restarted and then use the new definition.
DescribeStateMachine doesn't return updateDate but DescribeStateMachineForExecution returns it:
https://docs.aws.amazon.com/step-functions/latest/apireference/API_DescribeStateMachineForExecution.html

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

AWS Pipeline, wait for stage to complete

I have a AWS pipeline, which:
1) first stage, get template.yaml and build a ec2 windows instance via script
note when this machine boots up, via user data it starts a script to downloads requirements, git etc, code, setups iis and various other stuff.
so this happens once the cloudformation part has completed, and takes about another 5 mins
2) i then want to run external tests on this machine - maybe using blazemeter, as the second part of the pipeline
the problem is that between stage 1 and 2 i need to wait for the website to work on the box, so i need to wait at least 5 mins. i could add a manual approval stage, but this seams cumbersome.
does anyone have a way to add this timed wait? or a pipeline process to check the site is up?

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks