Airflow cannot run DAG because of upstream tasks been failed - python-2.7

I am trying to use Apache Airflow to create a workflow. So basically I've installed Airflow manually in my own anaconda kernel in server.
Here is the way I run a simple DAG
export AIRFLOW_HOME=~/airflow/airflow_home # my airflow home
export AIRFLOW=~/.conda/.../lib/python2.7/site-packages/airflow/bin
export PATH=~/.conda/.../bin:$AIRFLOW:$PATH # my kernel
When I do the same thing using airflow test, it worked for particular task independently. For example, in dag1: task1 >> task2
airflow test dag1 task2 2017-06-22
I suppose that it will run task1 first then run task2. But it just run task2 independently.
Do you guys have any idea about this ? Thank you very much in advance!
Here is my code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'txuantu',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['tran.xuantu#axa.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
def python_op1(ds, **kwargs):
print(ds)
return 0
def python_op2(ds, **kwargs):
print(str(kwargs))
return 0
# t1, t2 and t3 are examples of tasks created by instantiating operators
# t1 = BashOperator(
# task_id='bash_operator',
# bash_command='echo {{ ds }}',
# dag=dag)
t1 = PythonOperator(
task_id='python_operator1',
python_callable=python_op1,
# provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='python_operator2',
python_callable=python_op2,
# provide_context=True,
dag=dag)
t2.set_upstream(t1)
Airflow: v1.8.0
Using executor SequentialExecutor with SQLLite
airflow run tutorial python_operator2 2015-06-01
Here is error message:
[2017-06-28 22:49:15,336] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags
[2017-06-28 22:49:16,069] {base_executor.py:50} INFO - Adding to queue: airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --local -sd DAGS_FOLDER/tutorial.py
[2017-06-28 22:49:16,072] {sequential_executor.py:40} INFO - Executing command: airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --local -sd DAGS_FOLDER/tutorial.py
[2017-06-28 22:49:16,765] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags/tutorial.py
[2017-06-28 22:49:16,986] {base_task_runner.py:112} INFO - Running: ['bash', '-c', u'airflow run tutorial python_operator2 2015-06-01T00:00:00 --mark_success --job_id 1 --raw -sd DAGS_FOLDER/tutorial.py']
[2017-06-28 22:49:17,373] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,373] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-06-28 22:49:17,694] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,693] {models.py:167} INFO - Filling up the DagBag from /home/txuantu/airflow/airflow_home/dags/tutorial.py
[2017-06-28 22:49:17,899] {base_task_runner.py:95} INFO - Subtask: [2017-06-28 22:49:17,899] {models.py:1120} INFO - Dependencies not met for <TaskInstance: tutorial.python_operator2 2015-06-01 00:00:00 [None]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'successes': 0, 'failed': 0, 'upstream_failed': 0, 'skipped': 0, 'done': 0}, upstream_task_ids=['python_operator1']
[2017-06-28 22:49:22,011] {jobs.py:2083} INFO - Task exited with return code 0

If you only want to run python_operator2, you should execute:
airflow run tutorial python_operator2 2015-06-01 --ignore_dependencies=False
If you want to execute the entire dag and execute both tasks, use trigger_dag:
airflow trigger_dag tutorial
For reference, airflow test will "run a task without checking for dependencies."
Documentation for all three commands can be found at https://airflow.incubator.apache.org/cli.html

Finally, I found about an answer for my problem. Basically I thought airflow is lazy load, but it seems not. So the answer is instead of:
t2.set_upstream(t1)
It should be:
t1.set_downstream(t2)

Related

GCP Composer 2 (Airflow 2) Data proc operators - pass package to PYSPARK_JOB

I'm using GCP Composer2 to schedule pyspark (Structured Streaming) jobs,
The pyspark code reads/writes into Kafka.
The DAG uses operators - DataprocCreateClusterOperator (creates a GKE cluster),
DataprocSubmitJobOperator (runs the pyspark job), using operator - DataprocSubmitJobOperator deletes the dataproc cluster.
In the code below, i'm passing the jars and the files(certs/config files) required to run the pyspark code that reads/writes into Kafka
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"jar_file_uris" : ["gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar",
'gs://dataproc-spark-jars/bson-4.0.5.jar','gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar','gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar',
'gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar','gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar','gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar',
'gs://dataproc-spark-jars/spark-token-provider-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/htrace-core4-4.1.0-incubating.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-jars/spark-sql-kafka-0-10_2.12-3.2.0.jar','gs://dataproc-spark-jars/hadoop-client-runtime-3.3.1.jar','gs://dataproc-spark-jars/hadoop-client-3.3.1.jar','gs://dataproc-spark-configs/kafka-clients-3.2.0.jar'],
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
}
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
delete_dataproc_cluster = DataprocDeleteClusterOperator(
task_id="delete_dataproc_cluster",
project_id=PROJECT_ID,
cluster_name=CLUSTER_NAME,
region=REGION
)
create_dataproc_cluster >> run_dataproc_spark >> delete_dataproc_cluster
Question is - how do i pass package instead of the jars individually for spark-kafka?
When i do a spark-submit - i can pass a package, how do i do the same with Composer/Airflow ?
sample spark-submit command, where i pass the spark-sql-kafka and mongo-spark-connector packages
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 /Users/karanalang/PycharmProjects/Kafka/StructuredStreaming-KafkaConsumer-insignts.py
tia!
Update :
Based on #Anjela B's suggestion, tried the following but it does not work
changes to PYSPARK_JOB, to pass package :
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.1.3",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"file_uris":['gs://kafka-certs/versa-kafka-gke-ca.p12','gs://kafka-certs/syslog-vani.p12',
'gs://kafka-certs/alarm-compression-user.p12','gs://kafka-certs/appstats-user.p12',
'gs://kafka-certs/insights-user.p12','gs://kafka-certs/intfutil-user.p12',
'gs://kafka-certs/reloadpred-chkpoint-user.p12','gs://kafka-certs/reloadpred-user.p12',
'gs://dataproc-spark-configs/topic-customer-map.cfg','gs://dataproc-spark-configs/params.cfg','gs://kafka-certs/issues-user.p12','gs://kafka-certs/anomaly-user.p12']
}
Error :
22/06/17 22:57:28 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1655505629376_0004
22/06/17 22:57:29 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at versa-insights2-m/10.142.0.70:8030
22/06/17 22:57:30 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
Traceback (most recent call last):
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 442, in <module>
sys.exit(main())
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 433, in main
main_proc = insightGen()
File "/tmp/8991c714-7036-45ff-b61b-ece54cfffc51/alarm_insights.py", line 99, in __init__
self.all_DF = self.spark.read \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.load.
: java.lang.ClassNotFoundException: Failed to find data source: mongo. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: mongo.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
... 14 more
You may use the following code to pass the configuration:
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
# If you are running Airflow in more than one time zone
# see https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
# for best practices
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PYSPARK_JOB = {
"pyspark_job": {
"main_python_file_uri":
"gs://<bucket>/20220606.py", #this field is for .py packages
"properties": { #you can use this field to pass other properties
"org.apache.spark": "spark-sql-kafka-0-10_2.12:3.2.0",
"org.mongodb.spark": "mongo-spark-connector_2.12:3.0.2"
},
"python_file_uris": ["gs://<bucket>/20220606.py"]
},
"reference": {
"project_id": "<project_id>"
},
"placement": {
"cluster_name": "<cluster_name>"
}
}
REGION = "us-central1"
PROJECT_ID = "<project_id>"
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'composer_quickstart',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
run_dataproc_spark = DataprocSubmitJobOperator(
task_id="run_dataproc_spark",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID,
)
print_dag_run_conf >> run_dataproc_spark
I followed this PySpark Job Documentation to know which field to use to pass required packages.
AirFlow DAG logs:
*** Reading remote log from gs://us-central1-case-20220331-fde8f6be-bucket/logs/composer_quickstart/run_dataproc_spark/2022-06-06T06:53:24.637504+00:00/1.log.
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1033} INFO - Dependencies all met for <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [queued]>
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1239} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1240} INFO - Starting attempt 1 of 2
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1241} INFO -
--------------------------------------------------------------------------------
[2022-06-06, 06:53:39 UTC] {taskinstance.py:1260} INFO - Executing <Task(DataprocSubmitJobOperator): run_dataproc_spark> on 2022-06-06 06:53:24.637504+00:00
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:52} INFO - Started process 65510 to run task
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'composer_quickstart', 'run_dataproc_spark', 'manual__2022-06-06T06:53:24.637504+00:00', '--job-id', '21439', '--raw', '--subdir', 'DAGS_FOLDER/20220606_1.py', '--cfg-path', '/tmp/tmp7p1eyqqm', '--error-file', '/tmp/tmpdr2m4rwe']
[2022-06-06, 06:53:39 UTC] {standard_task_runner.py:77} INFO - Job 21439: Subtask run_dataproc_spark
[2022-06-06, 06:53:41 UTC] {logging_mixin.py:109} INFO - Running <TaskInstance: composer_quickstart.run_dataproc_spark manual__2022-06-06T06:53:24.637504+00:00 [running]> on host airflow-worker-7b5f8fc749-pd8f9
[2022-06-06, 06:53:44 UTC] {taskinstance.py:1426} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=
AIRFLOW_CTX_DAG_OWNER=Composer Example
AIRFLOW_CTX_DAG_ID=composer_quickstart
AIRFLOW_CTX_TASK_ID=run_dataproc_spark
AIRFLOW_CTX_EXECUTION_DATE=2022-06-06T06:53:24.637504+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-06-06T06:53:24.637504+00:00
[2022-06-06, 06:53:44 UTC] {dataproc.py:1878} INFO - Submitting job
[2022-06-06, 06:53:44 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1890} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 submitted successfully.
[2022-06-06, 06:53:45 UTC] {dataproc.py:1903} INFO - Waiting for job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 to complete
[2022-06-06, 06:54:16 UTC] {dataproc.py:1907} INFO - Job e7e800e7-fbfd-45e0-8021-eca4e2a7a377 completed successfully.
[2022-06-06, 06:54:16 UTC] {taskinstance.py:1268} INFO - Marking task as SUCCESS. dag_id=composer_quickstart, task_id=run_dataproc_spark, execution_date=20220606T065324, start_date=20220606T065339, end_date=20220606T065416
[2022-06-06, 06:54:16 UTC] {local_task_job.py:154} INFO - Task exited with return code 0
[2022-06-06, 06:54:16 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
Submitted Job:

Some task are not processing using Django-Q

I have a django Q cluster running with this configuration:
Q_CLUSTER = {
'name': 'pretty_name',
'workers': 1,
'recycle': 500,
'timeout': 500,
'queue_limit': 5,
'cpu_affinity': 1,
'label': 'Django Q',
'save_limit': 0,
'ack_failures': True,
'max_attempts': 1,
'attempt_count': 1,
'redis': {
'host': CHANNEL_REDIS_HOST,
'port': CHANNEL_REDIS_PORT,
'db': 5,
}
}
On this cluster I have a scheduled task supposed to run every 15 minutes.
Sometimes it works fine and this is what I can see on my worker logs:
[Q] INFO Enqueued 1
[Q] INFO Process-1 created a task from schedule [2]
[Q] INFO Process-1:1 processing [oranges-georgia-snake-social]
[ My Personal Custom Task Log]
[Q] INFO Processed [oranges-georgia-snake-social]
But other times the task does not start, this is what I get on my log:
[Q] INFO Enqueued 1
[Q] INFO Process-1 created a task from schedule [2]
And then nothing for the next 15 minutes.
Any idea where this might come from ?
So this was my prod environment and it appears that my dev environment was using the same redis db and even though no task existed on my dev environment it seems that this was the cause of the issue.
The solution was to change the redis db between my dev and prod environment !

bash: spark-submit: command not found while executing dag in AWS- Managed Apache Airflow

I have to run a spark job, (I am new to spark) and getting following error-
[2022-02-16 14:47:45,415] {{bash.py:135}} INFO - Tmp dir root location: /tmp
[2022-02-16 14:47:45,416] {{bash.py:158}} INFO - Running command: spark-submit --class org.xyz.practice.driver.PractitionerDriver s3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar
[2022-02-16 14:47:45,422] {{bash.py:169}} INFO - Output:
[2022-02-16 14:47:45,423] {{bash.py:173}} INFO - bash: spark-submit: command not found
[2022-02-16 14:47:45,423] {{bash.py:177}} INFO - Command exited with return code 127
[2022-02-16 14:47:45,437] {{taskinstance.py:1482}} ERROR - Task failed with exception
What has to be done,
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('s3://demoairflowpawan/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
spark_task = BashOperator(
task_id='spark_java',
bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
params={'class': 'org.xyz.practice.driver.PractitionerDriver', 'jar': 's3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar'},
dag=dag
)
The question is - why do you expect the spark-submit to be there?
If you created the airflow default pods, then they come with airflow code only.
You can check here an example for spark and airflow - https://medium.com/codex/executing-spark-jobs-with-apache-airflow-3596717bbbe3 - and they state specifically "Spark binaries must be added and mapped".
So you need to figure out how to download the spark binaries to the existing airflow pod.
Alternatively - you can create another k8s job which will do the spark-submit, and have your DAG activate this job.
sorry for the high level answer...

Why tasks are stuck in None state in Airflow 1.10.2 after a trigger_dag

I have a dummy DAG that I want to start episodically by setting its start_date to today and letting its schedul interval to daily
here is the DAG code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- airflow: DAG -*-
import logging
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
logger = logging.getLogger("DummyDAG")
def execute_python_function():
logging.info("HEY YO !!!")
return True
dag = DAG(dag_id='dummy_dag',
start_date=datetime.today())
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)
py_operator = PythonOperator(task_id='exec_function',
python_callable=execute_python_function,
dag=dag)
start >> py_operator >> end
In Airflow 1.9.0, when I do an airflow trigger_dag -e 20190701 the DAG Run is created, Tasks Instances are created, scheduled and executed.
However, in Airflow 1.10.2 the DAG Run is created Task Instances too but they are stuck in None state.
for both versions the depends_on_past is False
Here are the details of the start task in Airflow 1.9.0 (it is executed, with success, after some time)
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency: Reason
Dagrun Running: Task instance's dagrun was not in the 'running' state but in the state 'success'.
Task Instance State: Task is in the 'success' state which is not a valid state for execution. The task must be cleared in order to be run.
Execution Date: The execution date is 2019-07-10T00:00:00 but this is before the task's start date 2019-07-11T08:45:18.230876.
Execution Date: The execution date is 2019-07-10T00:00:00 but this is before the task's DAG's start date 2019-07-11T08:45:18.230876.
Task instance attribute
Attribute Value
dag_id dummy_dag
duration None
end_date 2019-07-10 16:32:10.372976
execution_date 2019-07-10 00:00:00
generate_command <function generate_command at 0x7fc9fcc85b90>
hostname airflow-worker-5dc5b999b6-2l5cp
is_premature False
job_id None
key ('dummy_dag', 'start', datetime.datetime(2019, 7, 10, 0, 0))
log <logging.Logger object at 0x7fca014e7f10>
log_filepath /home/airflow/gcs/logs/dummy_dag/start/2019-07-10T00:00:00.log
log_url https://i39907f7014685e91-tp.appspot.com/admin/airflow/log?dag_id=dummy_dag&task_id=start&execution_date=2019-07-10T00:00:00
logger <logging.Logger object at 0x7fca014e7f10>
mark_success_url https://i39907f7014685e91-tp.appspot.com/admin/airflow/success?task_id=start&dag_id=dummy_dag&execution_date=2019-07-10T00:00:00&upstream=false&downstream=false
max_tries 0
metadata MetaData(bind=None)
next_try_number 2
operator None
pid 180712
pool None
previous_ti None
priority_weight 3
queue default
queued_dttm None
run_as_user None
start_date 2019-07-10 16:32:08.483531
state success
task <Task(DummyOperator): start>
task_id start
test_mode False
try_number 2
unixname airflow
Task Attributes
Attribute Value
adhoc False
dag <DAG: dummy_dag>
dag_id dummy_dag
depends_on_past False
deps set([<TIDep(Not In Retry Period)>, <TIDep(Previous Dagrun State)>, <TIDep(Trigger Rule)>])
downstream_list [<Task(PythonOperator): exec_function>]
downstream_task_ids ['exec_function']
email None
email_on_failure True
email_on_retry True
end_date None
execution_timeout None
log <logging.Logger object at 0x7fc9e2085350>
logger <logging.Logger object at 0x7fc9e2085350>
max_retry_delay None
on_failure_callback None
on_retry_callback None
on_success_callback None
owner Airflow
params {}
pool None
priority_weight 1
priority_weight_total 3
queue default
resources {'disk': {'_qty': 512, '_units_str': 'MB', '_name': 'Disk'}, 'gpus': {'_qty': 0, '_units_str': 'gpu(s)', '_name': 'GPU'}, 'ram': {'_qty': 512, '_units_str': 'MB', '_name': 'RAM'}, 'cpus': {'_qty': 1, '_units_str': 'core(s)', '_name': 'CPU'}}
retries 0
retry_delay 0:05:00
retry_exponential_backoff False
run_as_user None
schedule_interval 1 day, 0:00:00
sla None
start_date 2019-07-11 08:45:18.230876
task_concurrency None
task_id start
task_type DummyOperator
template_ext []
template_fields ()
trigger_rule all_success
ui_color #e8f7e4
ui_fgcolor #000
upstream_list []
upstream_task_ids []
wait_for_downstream False
Details of the start task in Airflow 1.10.2
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Execution Date The execution date is 2019-07-11T00:00:00+00:00 but this is before the task's start date 2019-07-11T08:53:32.593360+00:00.
Execution Date The execution date is 2019-07-11T00:00:00+00:00 but this is before the task's DAG's start date 2019-07-11T08:53:32.593360+00:00.
Task Instance Attributes
Attribute Value
dag_id dummy_dag
duration None
end_date None
execution_date 2019-07-11T00:00:00+00:00
executor_config {}
generate_command <function generate_command at 0x7f4621301578>
hostname
is_premature False
job_id None
key ('dummy_dag', 'start', <Pendulum [2019-07-11T00:00:00+00:00]>, 1)
log <logging.Logger object at 0x7f4624883350>
log_filepath /home/airflow/gcs/logs/dummy_dag/start/2019-07-11T00:00:00+00:00.log
log_url https://a15d189066a5c65ee-tp.appspot.com/admin/airflow/log?dag_id=dummy_dag&task_id=start&execution_date=2019-07-11T00%3A00%3A00%2B00%3A00
logger <logging.Logger object at 0x7f4624883350>
mark_success_url https://a15d189066a5c65ee-tp.appspot.com/admin/airflow/success?task_id=start&dag_id=dummy_dag&execution_date=2019-07-11T00%3A00%3A00%2B00%3A00&upstream=false&downstream=false
max_tries 0
metadata MetaData(bind=None)
next_try_number 1
operator None
pid None
pool None
previous_ti None
priority_weight 3
queue default
queued_dttm None
raw False
run_as_user None
start_date None
state None
task <Task(DummyOperator): start>
task_id start
test_mode False
try_number 1
unixname airflow
Task Attributes
Attribute Value
adhoc False
dag <DAG: dummy_dag>
dag_id dummy_dag
depends_on_past False
deps set([<TIDep(Previous Dagrun State)>, <TIDep(Trigger Rule)>, <TIDep(Not In Retry Period)>])
downstream_list [<Task(PythonOperator): exec_function>]
downstream_task_ids set(['exec_function'])
email None
email_on_failure True
email_on_retry True
end_date None
execution_timeout None
executor_config {}
inlets []
lineage_data None
log <logging.Logger object at 0x7f460b467e10>
logger <logging.Logger object at 0x7f460b467e10>
max_retry_delay None
on_failure_callback None
on_retry_callback None
on_success_callback None
outlets []
owner Airflow
params {}
pool None
priority_weight 1
priority_weight_total 3
queue default
resources {'disk': {'_qty': 512, '_units_str': 'MB', '_name': 'Disk'}, 'gpus': {'_qty': 0, '_units_str': 'gpu(s)', '_name': 'GPU'}, 'ram': {'_qty': 512, '_units_str': 'MB', '_name': 'RAM'}, 'cpus': {'_qty': 1, '_units_str': 'core(s)', '_name': 'CPU'}}
retries 0
retry_delay 0:05:00
retry_exponential_backoff False
run_as_user None
schedule_interval 1 day, 0:00:00
sla None
start_date 2019-07-11T08:53:32.593360+00:00
task_concurrency None
task_id start
task_type DummyOperator
template_ext []
template_fields ()
trigger_rule all_success
ui_color #e8f7e4
ui_fgcolor #000
upstream_list []
upstream_task_ids set([])
wait_for_downstream False
weight_rule downstream
IMO it's not a problem of the version. If you check the logs, you will see the messages like:
Execution Date:
The execution date is 2019-07-10T00:00:00 but this is before the task's start date 2019-07-11T08:45:18.230876.
The execution date is the one you put in trigger_dag command whereas the start date of your DAG is changing because Python's datetime.today() returns the current time. To see that, you can do:
airflow#e3bc9a0a7a3e:~$ airflow trigger_dag dummy_dag -e 20190702
And later go to http://localhost:8080/admin/airflow/task?dag_id=dummy_dag&task_id=start&execution_date=2019-07-02T00%3A00%3A00%2B00%3A00 (or any corresponding URL) and refresh the page. You should see Dependency > Execution date changing every time.
In your case it will be problematic since you're trying to trigger a DAG from the past. A better way is to specify a static date or use any of Airflow's util methods to figure it out:
dag = DAG(dag_id='dummy_dag',
start_date=datetime(2019, 7, 11, 0, 0))
Otherwise, if you want to reprocess historical data, you can use airflow backfill
update
Running DAGs on demand
After clarifications from comments, we found another way to trigger a DAG on demand with the property schedule_interval=None.
If it's a subdag that you are trying to unpause, you need to go to the Logs tab. There, at the top-right corner, you will have the Pause/Unpause button.

2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working

I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI.
I am trying to run this on my local Ubuntu Dev Machine but having issues with scrapd daemon.
One of the Workers, linkgenerator is working but scraper as worker1 is not working.
I can not figure out why scrapyd won't run on another local instance.
Background Information about the configuration.
The application comes bundled with Django, Scrapy, Pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). I have 2 local Ubuntu instances in which Django, MongoDB, Scrapyd daemon and RabbitMQ server running on Instance1.
On another Scrapyd daemon is running on Instance2.
RabbitMQ Workers:
linkgenerator
worker1
IP Configurations for Instances:
IP For local Ubuntu Instance1: 192.168.0.101
IP for local Ubuntu Instance2: 192.168.0.106
List of tools used:
MongoDB server
RabbitMQ server
Scrapy Scrapyd API
One RabbitMQ linkgenerator worker (WorkerName: linkgenerator) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance1: 192.168.0.101
Another one RabbitMQ scraper worker (WorkerName: worker1) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance2: 192.168.0.106
Instance1: 192.168.0.101
"Instance1" on which Django, RabbitMQ, scrapyd daemon servers running -- IP : 192.168.0.101
Instance2: 192.168.0.106
Scrapy installed on instance2 and running scrapyd daemon
Scrapy Control Panel UI Snapshot:
from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post
RabbitMQ status info
linkgenerator worker can successfully push the message to RabbitMQ queue, linkgenerator spider generates start_urls for "scraper spider* are consumed by scraper (worker1), which is not working, please see the logs for worker1 in end of the post
RabbitMQ settings
The below file contains the settings for MongoDB and RabbitMQ:
SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'
MONGODB_PUBLIC_ADDRESS = 'OneScience:27017' # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017' # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100
# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False # Set this to True if you want a linux user account
linkgenerator scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scrapyd.conf file settings for Instance1 (192.168.0.101)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
scrapyd.conf file settings for Instance2 (192.168.0.106)
cat /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
dbs_dir = /var/lib/scrapyd/dbs
items_dir = /var/lib/scrapyd/items
logs_dir = /var/log/scrapyd
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port = 6800
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ Status
sudo service rabbitmq-server status
[sudo] password for mtaziz:
Status of node rabbit#ScrapyDevU79
[{pid,53715},
{running_applications,
[{rabbitmq_shovel_management,
"Management extension for the Shovel plugin","3.6.11"},
{rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
{rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
{rabbit,"RabbitMQ","3.6.11"},
{os_mon,"CPO CXC 138 46","2.2.14"},
{cowboy,"Small, fast, modular HTTP server.","1.0.4"},
{ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
{ssl,"Erlang/OTP SSL application","5.3.2"},
{public_key,"Public key infrastructure","0.21"},
{cowlib,"Support library for manipulating Web protocols.","1.0.2"},
{crypto,"CRYPTO version 2","3.2"},
{amqp_client,"RabbitMQ AMQP Client","3.6.11"},
{rabbit_common,
"Modules shared by rabbitmq-server and rabbitmq-erlang-client",
"3.6.11"},
{inets,"INETS CXC 138 49","5.9.7"},
{mnesia,"MNESIA CXC 138 12","4.11"},
{compiler,"ERTS CXC 138 10","4.9.4"},
{xmerl,"XML parser","1.3.5"},
{syntax_tools,"Syntax tools","1.6.12"},
{asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
{sasl,"SASL CXC 138 11","2.3.4"},
{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
[{connection_readers,0},
{connection_writers,0},
{connection_channels,0},
{connection_other,6856},
{queue_procs,145160},
{queue_slave_procs,0},
{plugins,1959248},
{other_proc,22328920},
{metrics,160112},
{mgmt_db,655320},
{mnesia,83952},
{other_ets,2355800},
{binary,96920},
{msg_index,47352},
{code,27101161},
{atom,992409},
{other_system,31074022},
{total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
[{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
scrapyd daemon on Instance1 ( 192.168.0.101 ) running status:
scrapyd
2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status:
scrapyd
2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs
After updating the code for RabbitMQ server settings followed by the suggestions made by #Tarun Lalwani
The suggestion was to use RabbitMQ Server IP - 192.168.0.101:5672 instead of
127.0.0.1:5672. After I updated as suggested by Tarun Lalwani got the new problems as below............
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
'log_count/ERROR': 2,
'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.
MongoDBPipeline
# coding:utf-8
import datetime
from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
from urllib.parse import quote
except:
from urllib import quote
def not_set(string):
""" Check if a string is None or ''
:returns: bool - True if the string is empty
"""
if string is None:
return True
elif string == '':
return True
return False
class MongoDBPipeline(BaseItemExporter):
""" MongoDB pipeline class """
# Default options
config = {
'uri': 'mongodb://localhost:27017',
'fsync': False,
'write_concern': 0,
'database': 'scrapy-mongodb',
'collection': 'items',
'replica_set': None,
'buffer': None,
'append_timestamp': False,
'sharded': False
}
# Needed for sending acknowledgement signals to RabbitMQ for all persisted items
queue = None
acked_signals = []
# Item buffer
item_buffer = dict()
def load_spider(self, spider):
self.crawler = spider.crawler
self.settings = spider.settings
self.queue = self.crawler.engine.slot.scheduler.queue
def open_spider(self, spider):
self.load_spider(spider)
# Configure the connection
self.configure()
self.spidername = spider.name
self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '#' + self.config['uri'] + '/admin'
self.shardedcolls = []
if self.config['replica_set'] is not None:
self.connection = MongoReplicaSetClient(
self.config['uri'],
replicaSet=self.config['replica_set'],
w=self.config['write_concern'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY_PREFERRED)
else:
# Connecting to a stand alone MongoDB
self.connection = MongoClient(
self.config['uri'],
fsync=self.config['fsync'],
read_preference=ReadPreference.PRIMARY)
# Set up the collection
self.database = self.connection[spider.name]
# Autoshard the DB
if self.config['sharded']:
db_statuses = self.connection['config']['databases'].find({})
partitioned = []
notpartitioned = []
for status in db_statuses:
if status['partitioned']:
partitioned.append(status['_id'])
else:
notpartitioned.append(status['_id'])
if spider.name in notpartitioned or spider.name not in partitioned:
try:
self.connection.admin.command('enableSharding', spider.name)
except errors.OperationFailure:
pass
else:
collections = self.connection['config']['collections'].find({})
for coll in collections:
if (spider.name + '.') in coll['_id']:
if coll['dropped'] is not True:
if coll['_id'].index(spider.name + '.') == 0:
self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])
def configure(self):
""" Configure the MongoDB connection """
# Set all regular options
options = [
('uri', 'MONGODB_URI'),
('fsync', 'MONGODB_FSYNC'),
('write_concern', 'MONGODB_REPLICA_SET_W'),
('database', 'MONGODB_DATABASE'),
('collection', 'MONGODB_COLLECTION'),
('replica_set', 'MONGODB_REPLICA_SET'),
('buffer', 'MONGODB_BUFFER_DATA'),
('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
('sharded', 'MONGODB_SHARDED'),
('username', 'MONGODB_USER'),
('password', 'MONGODB_PASSWORD')
]
for key, setting in options:
if not not_set(self.settings[setting]):
self.config[key] = self.settings[setting]
def process_item(self, item, spider):
""" Process the item and add it to MongoDB
:type item: Item object
:param item: The item to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
item_name = item.__class__.__name__
# If we are working with a sharded DB, the collection will also be sharded
if self.config['sharded']:
if item_name not in self.shardedcolls:
try:
self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
self.shardedcolls.append(item_name)
except errors.OperationFailure:
self.shardedcolls.append(item_name)
itemtoinsert = dict(self._get_serialized_fields(item))
if self.config['buffer']:
if item_name not in self.item_buffer:
self.item_buffer[item_name] = []
self.item_buffer[item_name].append([])
self.item_buffer[item_name].append(0)
self.item_buffer[item_name][1] += 1
if self.config['append_timestamp']:
itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
self.item_buffer[item_name][0].append(itemtoinsert)
if self.item_buffer[item_name][1] == self.config['buffer']:
self.item_buffer[item_name][1] = 0
self.insert_item(self.item_buffer[item_name][0], spider, item_name)
return item
self.insert_item(itemtoinsert, spider, item_name)
return item
def close_spider(self, spider):
""" Method called when the spider is closed
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: None
"""
for key in self.item_buffer:
if self.item_buffer[key][0]:
self.insert_item(self.item_buffer[key][0], spider, key)
def insert_item(self, item, spider, item_name):
""" Process the item and add it to MongoDB
:type item: (Item object) or [(Item object)]
:param item: The item(s) to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
self.collection = self.database[item_name]
if not isinstance(item, list):
if self.config['append_timestamp']:
item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}
ack_signal = item['ack_signal']
item.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
else:
signals = []
for eachitem in item:
signals.append(eachitem['ack_signal'])
eachitem.pop('ack_signal', None)
self.collection.insert(item, continue_on_error=True)
del item[:]
for ack_signal in signals:
if ack_signal not in self.acked_signals:
self.queue.acknowledge(ack_signal)
self.acked_signals.append(ack_signal)
To sum up, I believe the problem lies in scrapyd daemons running on both instances but somehow scraper or worker1 can not access it, I could not figure it out, I did not find any use cases on stackoverflow.
Any help is highly appreciated in this regard. Thank you in advance!