Couldn't start the dag in apache airflow MWAA - amazon-web-services

I am experiencing some weird error while starting my sample dag, it executed fine when i created it for the first time,but now executions seems to be failing.
Below is the error i am getting:-
Dag is just simple bash commands, which first print date followed by some more bash tasks.But its seems first task is getting failed before executions starts.
And Since its hasnt executed, it seems to not producing any logs as well.

Related

Kill APscheduler add_job based on id

We have a flask script get_logs.py that uses APScheduler and contains following job
scheduler.add_job(id="create_recommendation_entries", trigger = 'interval',seconds=60*10,func=create_entries)
Someone ran the script and now the the logs show that this script is still running at 10 minutes interval even after terminating.
The process id is not listed nor does it show using grep and we don't know whether it was executed using nohup or gunicorn.
How do I kill this job based on id="create_recommendation_entries"because I don't know any of its stats(port,pid etc).
Rerunning the script creates a different thread and stops after Ctrl+C but the previous one remains still in process

How to start a Snakemake workflow on AWS and detach?

I am trying to execute a Snakemake workflow on AWS, and have succeeded in executing my workflow using the command:
snakemake --tibanna --use-conda --default-remote-prefix=mybucket/myproject
and it works successfully. So far, so good. Unfortunately snakemake keeps running in the foreground in the terminal until the workflow ends. Using Ctrl-C on it ends the run. This is problematic for me when I want to run a pipeline that takes a few days.
Is there a way to run pipelines using snakemake --tibanna and detach and poll the results later?
I believe tibanna has the capability: tibanna run_workflow runs the workflow and detatches, and you can check the status later using tibanna stat. I just can't get snakemake to finish leaving the processes scheduled in the cloud.

After clearing logs Dag is in running state but task not getting scheduled/executed

I have airflow (v1.10.4) running in Kubernetes cluster using Celery Executor. We are using RDS Postgres as an external metadata db.
Problem Statement: All dags are getting scheduled and run on time but the problem happens when we clear any log (old run) to re-run the pipeline Airflow puts this Dag in running state but none of the tasks gets executed. We also saw when we re-mark those cleared logs as success Airflow schedules the next run but none of the tasks get executed (none of the tasks goes into queue or running state).
The dag in question has depends_on_past=True and catchup=True.
Can anyone know what's going wrong here?

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

Dataflow process hanging

I am running a batch job on dataflow, querying from BigQuery. When I use the DirectRunner, everything works, and the results are written to a new BigQuery table. Things seem to break when I change to DataflowRunner.
The logs show that 30 worker instances are spun up successfully. The graph diagram in the web UI shows the job has started. The first 3 steps show "Running", the rest show "not started". None of the steps show any records transformed (i.e. outputcollections all show '-'). The logs show many messages that look like this, which may be the issue:
skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "Back-off 10s restarting failed container=python pod=......
I took a step back and just ran the minimal wordcount example, and that completed successfully. So all the necessary APIs seem to be enabled for Dataflow runner. I'm just trying to get a sense of what is causing my Dataflow job to hang.
I am executing the job like this:
python2.7 script.py --runner DataflowRunner --project projectname --requirements_file requirements.txt --staging_location gs://my-store/staging --temp_location gs://my-store/temp
I'm not sure if my solution was the cause of the error pasted above, but fixing dependencies problems (which were not showing up as errors in the log at all!) did solve the hanging dataflow processes.
So if you have a hanging process, make sure your workers have all their necessary dependencies. You can provide them through the --requirements_file argument, or through a custom setup.py script.
Thanks to the help I received in this post, the pipeline appears to be operating, albeit VERY SLOWLY.