Amazon SWF: at least one worker has to be running, why?

Amazon SWF: at least one worker has to be running, why? - amazon-web-services

I've just started out using the AWS Ruby SDK to manage as simple workflow. One behavior I noticed right away is that at least one relevant worker and one relevant decider must be running prior to submitting a new workflow execution.
If I submit a new workflow execution before starting my worker and decider, then the tasks are never picked up, even when I'm still well within time-out limits. Why is this? Based on the description of how the HTTP long polling works, I would expect either app to receive the relevant tasks when the call to poll() is reached.
I encounter other deadlocking situations after a job fails (e.g. due to a worker or decider bug, or due to being terminated). Sometimes, re-running or even just starting an entirely new workflow execution will result in a deadlocked workflow execution. The initial decision tasks are shown in the workflow execution history in the AWS console, but the decider never receives them. Admittedly, I'm having trouble confirming/reducing this issue to a test case, but I suspect it is related to the above issue. This happens roughly 10 to 20% of the time; the rest of the time, everything works.
Some other things to mention: I'm using a single task list for two separate activity tasks that run in sequence. Both the worker and the decider are polling the same task list.
Here is my worker:
require 'yaml'
require 'aws'
config_file_path = File.join(File.dirname(File.expand_path(__FILE__)), 'config.yaml')
config = YAML::load_file(config_file_path)
swf = AWS::SimpleWorkflow.new(config)
domain = swf.domains['test-domain']
puts("waiting for an activity")
domain.activity_tasks.poll('hello-tasklist') do |activity_task|
puts activity_task.activity_type.name
activity_task.complete! :result => name
puts("waiting for an activity")
end
EDIT
Another user on the AWS forums commented:
I think the cause is in SWF not immediately recognizing a long poll connection shutdown. When you kill a worker its connection for some time can be considered open by the service. So it still can dispatch a task to it. To you it looks like the new worker never getting it. The way to verify it is to check the workflow history. You'll see activity task started event with identify field that contains host and pid of the dead worker. Eventually such task is going to time out and can be retried by the decider.
Note that such condition is common during unit tests that frequently terminate connections and is not really a problem for any production applications. The common workaround is to use different task list for each unit test.
This seems to be a pretty reasonable explanation. I'm going to try to confirm this.

You've raised two issues: one regarding start of an execution with no active deciders and the other regarding actors crashing in the middle of a task. Let me address them in order.
I have carried out an experiment based on your observations and indeed, when a new workflow execution starts and no deciders are polling SWF still thinks that a new decision task gets started. The following is my event log from the AWS console. Note what happens:
Fri Feb 22 22:15:38 GMT+000 2013 1 WorkflowExecutionStarted
Fri Feb 22 22:15:38 GMT+000 2013 2 DecisionTaskScheduled
Fri Feb 22 22:15:38 GMT+000 2013 3 DecisionTaskStarted
Fri Feb 22 22:20:39 GMT+000 2013 4 DecisionTaskTimedOut
Fri Feb 22 22:20:39 GMT+000 2013 5 DecisionTaskScheduled
Fri Feb 22 22:22:26 GMT+000 2013 6 DecisionTaskStarted
Fri Feb 22 22:22:27 GMT+000 2013 7 DecisionTaskCompleted
Fri Feb 22 22:22:27 GMT+000 2013 8 ActivityTaskScheduled
Fri Feb 22 22:22:29 GMT+000 2013 9 ActivityTaskStarted
Fri Feb 22 22:22:30 GMT+000 2013 10 ActivityTaskCompleted
...
The first decision task was immediately scheduled (which is expected) and started right away (i.e. allegedly dispatched to a decider, even though no decider was running). I started a decider in the meantime, but the workflow didn't move until the timeout of the original decision task, 5 minutes later. I can't think of a scenario where this would be the desired behavior. Two possible defenses against that: have deciders running before starting a new execution or set an acceptably low timeout on a decision task (these tasks should be immediate anyway).
The issue of crashing actor (either decider or worker) is one that I'm familiar with. A short background note first:
Both activity and decision tasks are recored by the service in 3 stages:
Scheduled = ready to be picked up by an actor.
Started = already picked up by an actor.
Completed/Failed or Timed out = the actor either or completed failed or not finished the task within deadline.
Once the actor picked up a task and crashed, it is obviously not going to report anything back to the service (unless it is able to recover and still remembers task token of the dispatched task - but most crashing actors wouldn't be that smart). The next time a decision task will be scheduled, will be upon time-out of the recently dispatched task, which is why all actors seem to be blocked for the duration of a task timeout. This is actually the desired behavior: The service can't know whether the task is being worked on or not as long as the worker still works within its deadline. There is a simple way to deal with this: fit your actors with a try-catch block and fail a task when an unexpected crash happens. I would discourage from using separate tasklists for each integ test. Instead, I'd recommend failing the task in the teardown() block. SWF allows to specify a reason for failing a task, which is one way of logging failures and viewing them later through the AWS console.

Related

elastic search update Service software release in AWS console

after pressing update service software release in AWS console the following message appeared An update to release *******has been requested and is pending.
Before the update starts, you can cancel it any time."
Right now I waited for 1 day - still pending.
Any ideas how much time does it take, or do i need to do anything to move it from pending to updating, and should i expect any downtime in the update processenter image description here

I requested the R20210426-P2 update on a Monday and it was completed on the next Saturday so roughly 6 days from request to actual update. It's also worth noting that the update does not show up in the Upgrade tab in the UI, it shows up in the Notifications tab with this:
Service software update R20210426-P2 completed.
[UPDATE 11 Jul 2021] I just proceeded with updates on two additional domains and the updates began within 15 minutes.
[UPDATE 17 Dec 2021 Log4J CVE] I've had variable luck with the R20211203-P2. One cluster updated in a few hours and one took a few days. A third I was sure I started a few days ago but it gave me the option to update today (possibly a timeout?). I'm guessing they limit the number of concurrent updates and things are backed up. I recommend continuing to check the console but have patience, they do eventually get updated. If you have paid support, definitely open a ticket.

A timeout was reached (45000 milliseconds) while waiting for the MyService service to connect

I have developed a Win32 service (SERVICE_WIN32_OWN_PROCESS) in C++ for Windows 10. It fails to start once in a while, with the following messages in the event log:
A timeout was reached (45000 milliseconds) while waiting for the MyService service to connect.
The MyService service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.
What kind of timeout is happening here?
I know that when a service starts up, there is a timeout of 30 seconds from the start of the executable to the call of StartServiceCtrlDispatcher(). I have a log statement just before the call to StartServiceCtrlDispatcher(), but I do not see it. Unfortunately, I do not have any log statements at the point where the service starts up. In between startup and StartServiceCtrlDispatcher(), I have a bit of initialization, but nothing that I would expect to take 30 seconds to finish.
My service never reaches StartServiceCtrlDispatcher() and I have not seen traces in the event log that it crashes.
So, why does the error message mention a timeout of 45 seconds and not 30 seconds? What does this timeout represent?
Edit: For now I am mostly interested if other persons have experienced similar timeout and if they have figured out the reason. I need to debug my code, but I hope that someone might be able to give a direction in which I could concentrate my debugging in. Later I might need specific help with my code when I know where to look :-)
Edit: Microsoft describes many kinds of timeout in their API documentation for services. But I have not seen any mentioning of a 45 seconds timeout even if I have read all the API calls that I am using.
Note: I have not modified any timeouts in the system/registry, if such a thing is possible.
Edit:
Notes about my service.
The issue happens on a users pc that I do not have direct access to.
My service starts up correctly most of the time, but when it fails, it might be during a windows update under boot-up, that causes it.
In a virtual machine with a debug version of my service it takes less than 2 seconds from start of executable to call of StartServiceCtrlDispatcher(). That sounds reasonable. Far below 30 seconds and 45 seconds.
I have in my development environment tried to add a delay (sleep) between start and StartServiceCtrlDispatcher() of greater than 30 seconds. This gave me the standard message about a 30000 miliseconds timeout. Not 45000!
I have tried to force a crash between start and StartServiceCtrlDispatcher(). This gave me a "Application Error" event log entry about the crash and a standard 30000 seconds timeout. Not 45000! On the problem PCs eventlog I have not noticed any "Application Error" when the startup failed.

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.

I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

schedule task not working, log file with no errors

I have a simple task schedule that sends a email for test.
It's not working at all,looking at the logs:
scheduler.log
Jul 8, 2016 1:20 PM Information scheduler-1
[test] Executing at Fri Jul 08 13:20:00 PDT 2016
It shows that it has run, and I also think its not running other task.
Looking at the application log I also see no errors.
Is there any other place I should be looking at?

The log you show above indicates if the scheduler ran if expected. It does not indicate if you page you ran was success.
To find out what happened with the page, go the the scheduled task editor and "Save output to a file".
Then specify a file name. Depending on the nature of the scheduled task, you may want to publish to a shared directory, or keep it hidden way.
Make sure to choose "Overwrite" so that you can always get the latest result of your scheduled task.

ColdFusion 8 scheduler not rescheduling task

I have just done a clean install of CF8 on a Windows 2000 machine. I have a scheduled task I need to run every 15 minutes on this machine, and the machine does little else.
The task is set up as normal through CF admin, but for some reason, when the task takes about 5 minutes to run it will complete fine (I can see this from debug output and from cfstat) but the scheduler will not reschedule the task.
The scheduling log shows that the task started to execute, but not entry that it was rescheduled. Eg:
[ProcessRecords] Executing at Wed May 20 10:30:00 BST 2009
I have been over my server timeouts. I have NO timeout in CF admin and this particular script has a <cfsetting requesttimeout="43200" /> tag set. There are no exceptions in the console logging. The last bit of console logging is the very last debug statement in my .cfm template.
I do notice that task that run in a shorter time, say for example under a minute, will reschedule as normal.
Has anyone come across a problem like this before?
I'm baffled. Any and all replies are appreciated!
Cheers,
Ciaran

not for nothing, but i've never seen anything like this with cf8. are you sure that you have the latest hotfix and jvm installed? this might have been something in cf8 that was fixed in 8.01.
hotfix 2 for cf8.01
list of all hotfixes and updates for cf8.01
hotfix 3 for cf8
list of all hotfixes and updates for cf8
latest jvm
upgrade instruction for jvm
If you suspect that it's an uncaught exception causing the issue, then might I suggest logging portions of the process. Case in point, I had a similar problem with a scheduled task where it would just bottom out for no reason (never had the reschedule problem though). What I ended up doing to diagnose the problem was use cflog to write out portions of the process as they completed. This particular task too about 4 minutes to complete but ran through about 200 portions (it was a mass emailer for a bunch of clients).
I logged the when the portion started and completed along with how log it took. By doing so, i could see what portion would trip up the whole process and knew where to focus my attention.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js