Amazon SWF: Java activity workers receiving wrong tasks

Amazon SWF: Java activity workers receiving wrong tasks - amazon-web-services

I started with Amazon's Java-based HelloWorldWorkflowDistributed example and I'm adding to it little by little to achieve what we want. I have added a second activity worker, but the two activities are receiving each other's tasks and no tasks are getting accomplished. Can anyone point me to a COMPLETE, WORKING example of a workflow that calls out to two or more distinct workers?
E.g. the following error appears in the console where BarActivities.getName is running, and vice versa:
Aug 26, 2016 2:15:24 PM com.amazonaws.services.simpleworkflow.flow.worker.SynchronousActivityTaskPoller execute
SEVERE: Failure processing activity task with taskId=10, workflowGenerationId=id_for_107, activity={Name: FooActivities.getAddress,Version: 1.0.7}, activityInstanceId=1
com.amazonaws.services.simpleworkflow.flow.ActivityFailureException: Unknown activity type: {Name: FooActivities.getAddress,Version: 1.0.7} : null
at com.amazonaws.services.simpleworkflow.flow.worker.SynchronousActivityTaskPoller.execute(SynchronousActivityTaskPoller.java:194)
at com.amazonaws.services.simpleworkflow.flow.worker.ActivityTaskPoller$2.run(ActivityTaskPoller.java:92)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Activity workers poll for activity tasks using task lists. I believe you added a new worker without using a separate task list for its activities. As both workers share the same task list they end up sometimes receiving tasks for activities that they don't support which results in the "Unknown activity type" exception. The solution is to use a different task list for each worker.

Related

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?

I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Airflow - Subdag stuck in running state with its internal tasks ended successfully

we're facing a very strange behavior in Airflow.
Some subdags stuck running (with its internal tasks successfully ended).
For example, I have the subdag load-folder-to-layer that should have started and ended in 2020-11-18, but it stuck until 20th.
I looked in the task_instances and job table and could see that the task's job was executing and receiving heartbeat:
The last log message in the subdag is:
[2020-11-18 09:22:01,879] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: bietlejuice.docx.load-folder-reference-to-clean 2020-11-17T03:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Also, I could notice a node drop on the exact same day 18th (image):
This leads me to think that the Scheduler is facing a bug when reatributing this task(subdag) to another worker and leading the task to get stuck.
Does someone have a clue about it?

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.

I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

Django + Celery with long-term scheduled tasks

I'm developing a Django app which relies heavily on Celery task scheduling, using Redis as backend. Tasks can be set to run at a large periods of time, as well as in a few seconds/minutes.
I've read about Redis visibility timeout and consequences of scheduling tasks with timedelta greater than visibility timeout (I'm also in the process of dealing with it in a previous project), so I'm interested if there's anything neater than my solution, which is to have another "helper" task run 5 minutes before the "main" one needs to be executed, scheduling the "main" task to run in required time, storing task id in DB, and then checking in "main" task if the stored task id is the one that is being run. The last part (with task id storing) is required as multiple runs of "helper" task could spawn a lot of "main" task instances, but with this approach each will have different task id.
I really hate how that approach sounds and how it works, as if the task is scheduled to be run a month from current time, "helper" and "main" tasks are executed up to a hundred times.
I also know that it's an open issue, so I'm interested in more a neat workaround than a solution itself.

Having tested available options, in my opinion only using RabbitMQ as broker solves the whole problem.
Although it's a viable option for me, lack of some of redis configuration parameters (e.g. pool size) makes it unusable for those who are using hosting services with some limit on opened broker connection.

Throttling while registering activities in Simple Work Flow

We have started to experience failures when our processes start up during the registration of activities. The problem is happening in GenericActivityWorker.registerActivityTypes.
The exception generate is:
Caused by: AmazonServiceException: Status Code: 400, AWS Service: AmazonSimpleWorkflow, AWS Request ID: 78726c24-47ee-11e3-8b49-534d57dc0b7f, AWS Error Code: ThrottlingException, AWS Error Message: Rate exceeded
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:350)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:202)
at com.amazonaws.services.simpleworkflow.AmazonSimpleWorkflowClient.invoke(AmazonSimpleWorkflowClient.java:3061)
at com.amazonaws.services.simpleworkflow.AmazonSimpleWorkflowClient.registerActivityType(AmazonSimpleWorkflowClient.java:2231)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerActivityType(GenericActivityWorker.java:153)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerActivityTypes(GenericActivityWorker.java:118)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericActivityWorker.registerTypesToPoll(GenericActivityWorker.java:105)
at com.amazonaws.services.simpleworkflow.flow.worker.GenericWorker.start(GenericWorker.java:367)
at com.amazonaws.services.simpleworkflow.flow.ActivityWorker.start(ActivityWorker.java:248)
at com.fluid.retail.workflows.DefaultWorkflowHost.start(DefaultWorkflowHost.java:226)
... 5 more
The ActivityWorker in question has 5 activity implementation classes associated with it, and I think that this throttling is occurring because the internal Flow Framework code is looping over the activity types to register them without any delay in between them.
Because this code is internal to the framework, we can't add any sleep() calls to prevent being throttled.
Any ideas would be appreciated.

Are you sure this is happening during registering your ur activities? Or it is happening during scheduling your activities?
You would get this issue if you try to run a workflow that will schedule too many activities too fast. At this point you have 2 options.
1. Try and make the activites sequential and make them wait on the previous one.
2. Contact AWS to increase your accounts rate.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js