AWS Pipeline, wait for stage to complete - amazon-web-services

I have a AWS pipeline, which:
1) first stage, get template.yaml and build a ec2 windows instance via script
note when this machine boots up, via user data it starts a script to downloads requirements, git etc, code, setups iis and various other stuff.
so this happens once the cloudformation part has completed, and takes about another 5 mins
2) i then want to run external tests on this machine - maybe using blazemeter, as the second part of the pipeline
the problem is that between stage 1 and 2 i need to wait for the website to work on the box, so i need to wait at least 5 mins. i could add a manual approval stage, but this seams cumbersome.
does anyone have a way to add this timed wait? or a pipeline process to check the site is up?

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

AWS BeanStalk: Often restarts the service automatically

I have a super small C# .NET Framework web application that I run on AWS Beanstalk. It works fine, except for one peculiar thing:
AWS seems to restart the service with unregular interval. I can see this as I can see log prints that only happens on startup.
In this case I am using ServiceStack AppHostBase, which has a Configure method that should be executed only once, on startup. In this method, a log print is done, and this gets executed between every 15 min to 2 hours approximately, indicating that the service is restarted at those times:
No update of the build has been done at those times seen above.
The Beanstalk configuration looks like this:
I would like to know why it restarts and what I can do to stop it.

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks

AWS Elastic Beanstalk Worker timing out after inactivity during long computation

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)
After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job
From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.
Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)
For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi

Hello World PipeLine with ShelCommandlActivity

I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/