Chef service restart_command not running on AWS opsworks instnace - amazon-web-services

We are facing a strange issue with a restart_command on Chef service definition not being executed. We have a task in AWS OpsWorks that is being executed with the following service definition :
service "celery-worker-1" do
action [ :nothing ]
supports :restart=>true, :status=>true
retries 3
restart_command 'sv force-stop celery-worker-1 ; sv start celery-worker-1'
if node[:opsworks][:instance][:layers].include?('celery-worker')
subscribes :restart, "deploy_revision[testapp]", :delayed
end
end
Then this is called at the end of the file from notifies
elsif node[:opsworks][:instance][:layers].include?('celery-worker')
notifies :restart, resources(:service => "celery-worker-0", :service => "celery-worker-1")
When this task is executed from OpsWorks the logs show no errors or issues :
[2018-02-09T08:33:34+00:00] INFO: Processing service[celery-worker-0] action nothing (testapp::configure line 17)
[2018-02-09T08:33:34+00:00] INFO: Processing service[celery-worker-1] action nothing (testapp::configure line 27)
But when we check on the server itself the celery-workers were not restarted. Executing manually the command from restart_command on the server, works without any issues. So, it seems Chef is not executing this restart_command for some reason :
'sv force-stop celery-worker-1 ; sv start celery-worker-1'
Thanks in advance for the help.

That would mean that either node[:opsworks][:instance][:layers].include?('celery-worker') is false or deploy_revision[testapp] is not updating. For the latter you can see in the output, if you get something like deploy_revision[testapp] (up-to-date) then it isn't updating so no notification trigger. For the layers data, you would have to check that manually.

Related

Scheduled Cloud build trigger throws 404 NOT_FOUND error

I recently created a scheduled trigger by following this google page: . But when I did a test run from Scheduler's interface, the result was a NOT_FOUND error:
{
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/myproject/locations/australia-southeast1/jobs/trigger-schedule"
status: "NOT_FOUND"
targetType: "HTTP"
url: "https://cloudbuild.googleapis.com/v1/projects/myproject/triggers/ca55b01d-f4e6-4b8b-b92b-b2e4f380788c:run"
}
I was worried about location, which is appEngine related, even there is no instances, the location shows to be in australia-southeast1, which is correct.
What could be the cause of the error? Or even what was not found? the job definition or the target?
After running gcloud beta builds triggers run TRIGGER which is the scheduled job runs, I found the cloudbuild.yaml does not exist in the targeted branch.
First, I wish the error in the scheduler could have been more meaningful and had some details.
Second, triggers all have conditions how they are triggered. Maybe the POST HTTP call to the trigger can allow an empty body to use default condition. In my case, the condition defined in the trigger was branch = test and in my scheduled job definition was branch = master. This mismatch caused the problem.
Hope this could help others to debug scheduled triggers.

When I get 'services has reached steady state', in Amazon ECS does it means some tasks had stopped?

Does this means that my service tasks are stopping or it's ok to get these log messages?
actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.
No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html
Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

GoCD Custom Command

I am trying to run a very simple custom command "echo helloworld" in GoCD as per the Getting Started Guide Part 2 however, the job does not finish with the Console saying Waiting for console logs and raw output saying Console log for this job is unavailable as it may have been purged by Go or deleted externally.
My job looks like the following which was taken from typing "echo" in the Lookup Command (which is different to the Getting Started example which I tried first with the same result)
Judging from the screenshot, the problem seems to be that no agent is assigned to the task. For an agent to be assigned, it must satisfy all of these conditions:
An agent must be running, and connected to the server
The agent must be enabled on the "Agents" page
If you use environments, the job and the agent need to be in the same environment
The agent needs to have all of the resources assigned that are configured in the job
Found the issue.
The Pipelines have to be in the same Environment to work.