Dataflow process hanging - google-cloud-platform

I am running a batch job on dataflow, querying from BigQuery. When I use the DirectRunner, everything works, and the results are written to a new BigQuery table. Things seem to break when I change to DataflowRunner.
The logs show that 30 worker instances are spun up successfully. The graph diagram in the web UI shows the job has started. The first 3 steps show "Running", the rest show "not started". None of the steps show any records transformed (i.e. outputcollections all show '-'). The logs show many messages that look like this, which may be the issue:
skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "Back-off 10s restarting failed container=python pod=......
I took a step back and just ran the minimal wordcount example, and that completed successfully. So all the necessary APIs seem to be enabled for Dataflow runner. I'm just trying to get a sense of what is causing my Dataflow job to hang.
I am executing the job like this:
python2.7 script.py --runner DataflowRunner --project projectname --requirements_file requirements.txt --staging_location gs://my-store/staging --temp_location gs://my-store/temp

I'm not sure if my solution was the cause of the error pasted above, but fixing dependencies problems (which were not showing up as errors in the log at all!) did solve the hanging dataflow processes.
So if you have a hanging process, make sure your workers have all their necessary dependencies. You can provide them through the --requirements_file argument, or through a custom setup.py script.
Thanks to the help I received in this post, the pipeline appears to be operating, albeit VERY SLOWLY.

Related

Dataflow Job failing

I have a pipeline which requires a Dataflow Job to run. I was using the gcloud CLI command to start a dataflow job which was working fine for over a month. But since last three days the dataflow job is failing within 10-20 sec with the following error log.
Failed to start the VM, launcher-2022012621245117717885921401920990, used for launching because of status code: UNAVAILABLE, reason: One or more operations had an error: 'operation-1643261093401-5d68989bed339-a33de830-9f90d92a': [UNAVAILABLE] 'HTTP_503'..
The command I'm using is:
gcloud dataflow sql query "SELECT tr.* FROM pubsub.topic.`my_project`.pubsub_topic as tr"
--job-name test_job
--region asia-south1
--bigquery-write-disposition write-empty
--bigquery-project my_project
--bigquery-dataset test_dataset --bigquery-table table_name
--max-workers 1 --worker-machine-type n1-standard-1
I tried starting the job from cloud console with same parameters as well which failed with the same error log. I have tested the job run from console before and it worked fine. The issue started a couple days ago.
What could be going wrong?
Thanks.
The Google Cloud error model indicates that a 503 means the service is unavailable [1].
You may try to change the region, for example, from europe-north1 to europe-west4, that should work. Additionally, you shouldn't include your job ID on Stack Overflow.
[1] https://cloud.google.com/apis/design/errors#handling_errors

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

How to execute mvn clean install in goCD

mvn clean build command does not execute in GoCD , The pipe line gets triggered but there is nothing displayed in logs and the job keeps running forever after setting inactivity time to 1 min.
I have created a pipe line and added mvn clean install command to it as in below image.Please let me know what needs to changed to generate artifacts as first step.
The most important clue is in your first screenshot, it says "Agent: Not yet assigned". That means that no agent (aka worker) could be found that that can handle your job.
Please read the manual on managing agents, specifically the section Matching jobs to agents.
Frequent reasons why no agent can be assigned:
No agents available at all
The agent(s) are in environments, but the pipeline isn't
Mismatch between resources specified in the job and in the agent management.

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!
Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.
I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)
I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10

AWS Elastic Beanstalk - Starting SWF Background Workers

I have been trying to find out the best way to run background jobs using PHP on AWS Elastic beanstalk, and after many hours searching on Google and SO, I believe that one good solution is using SWF and activity workers.
I found this example buried in the aws-sdk-for-php: https://github.com/amazonwebservices/aws-sdk-for-php/tree/master/_samples/AmazonSimpleWorkflow/cron
The read-me file says:
To run this sample, you need to execute three scripts from the command line in separate terminal/console windows
and
Note that the start_cron_example_workflow.php script will exit quickly
while the decider and activity worker scripts keep running until you
manually terminate them.
the decider and activity worker will loop "forever", and trying to run these in EB is what I'm having trouble doing.
In my .ebextensions directory I have a file that executes these files:
container_commands:
01background_task:
command: "php -f start_cron_example_activity_workers.php"
02background_task:
command: "php -f start_cron_example_workflow_workers.php"
But I get the following error messages:
ERROR
Failed to deploy application version.
ERROR
Some instances have not responded to commands. Responses were not received from [i-a5417ed4].
Any way I can do this using config files? How can I make this work in AWS EB without introducing a single point of failure?
Thank you.
You might consider using a service like IronWorker — this is specifically designed for what you are trying to do and will probably work better than putting together your own solution on a micro instance.
I have not used Iron.io yet, but was evaluating it as I am looking to move my stuff over to AWS so I need to have cron jobs handled as well.
Have you taken a look at the Fat Controller ? It can daemonise anything. There's documentation and examples on the website: http://fat-controller.sourceforge.net