Airflow: How to unpause DAG from python script - python-2.7

I'm creating a python script to generate DAGs (generate a new python file with the DAG specifications) from templates. It all works fine, except that I need the DAG to be be generated as unpaused.
I've searched and tried to run shell commands in the script like this:
bash_command1 = 'airflow list_dags'
bash_command2 = 'airflow trigger_dag ' + str(DAG_ID)
bash_command3 = 'airflow list_tasks ' + str(DAG_ID)
bash_command4 = 'airflow unpause '+ str(DAG_ID)
subprocess.call(bash_command1.split())
subprocess.call(bash_command2.split())
subprocess.call(bash_command3.split())
subprocess.call(bash_command4.split())
But every time I create a new DAG it is shown as paused in the web UI.
By the research I´ve made, the command airflow unpause <dag_id> should solve the problem, but when the script executes it, I get the error:
Traceback (most recent call last):
File "/home/cubo/anaconda2/bin/airflow", line 28, in <module>
args.func(args)
File "/home/cubo/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 303, in unpause
set_is_paused(False, args, dag)
File "/home/cubo/anaconda2/lib/python2.7/site- packages/airflow/bin/cli.py", line 312, in set_is_paused
dm.is_paused = is_paused
AttributeError: 'NoneType' object has no attribute 'is_paused'
But when I execute the same airflow unpause <dag_id> command in the terminal it works fine, and it prints:
Dag: <DAG: DAG_ID>, paused: False
Any help would be greatly appreciated.

Airflow (1.8 and newer) pauses new dags by default. If you want all DAGs to be unpaused at creation, you can override the Airflow config to retain the prior behavior of unpausing at creation.
Here's the link that walks you through setting configuration options. You want to set a core configuration setting: dags_are_paused_at_creation to False.
We use the environment variable approach on my team.

Related

Submit a pyspark job with a config file on Dataproc

I'm newbie on GCP and I'm struggling with submitting pyspark job in Dataproc.
I have a python script depends on a config.yaml file. And I notice that when I submit the job everything is executed under /tmp/.
How can I make available that config file in the /tmp/ folder?
At the moment, I get this error:
12/22/2020 10:12:27 AM root INFO Read config file.
Traceback (most recent call last):
File "/tmp/job-test4/train.py", line 252, in <module>
run_training(args)
File "/tmp/job-test4/train.py", line 205, in run_training
with open(args.configfile, "r") as cf:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://network-spark-migrate/model/demo-config.yml'
Thanks in advance
Below a snippet worked for me:
gcloud dataproc jobs submit pyspark gs://network-spark-migrate/model/train.py --cluster train-spark-demo --region europe-west6 --files=gs://network-spark-migrate/model/demo-config.yml -- --configfile ./demo-config.yml

ImportError: No module named idlelib" when running Google Dataflow worker

I have a python 2.7 script I run locally to launch a Apache Beam / Google Dataflow job (SDK 2.12.0). The job takes a csv file from a Google storage bucket, processes it and then creates an entity in Google Datastore for each row. The script ran fine for years ...but now it is failing:
INFO:root:2019-05-15T22:07:11.481Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-05-15T21:47:13.370Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 773, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 280, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 410, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 827, in _import_module
return __import__(import_name)
ImportError: No module named idlelib
I believe this error is happening at the worker level (not locally). I don't make reference to it in my script. To make sure it wasn't me I have installed updates for all google-cloud packages, apache-beam[gcp] etc locally -just in case. I tried importing idlelib into my script I get the same error. Any suggestions?
It has been fine for years and started failing from SDK 2.12.0 release.
What was the last release that this script succeeding on? 2.11?

CeleryBeat suddenly skips scheduling few tasks while using Custom DatabaseScheduler

We are using
django-celery==3.1.10
celery==3.1.20
python 2.7.13
Broker - RabbitMQ
We have written a CustomDataBaseScheduler to schedule task which reads entries from a MySQL Table(Django Model) and schedules the task perfectly on given time (mysql Table column specifies the time). We are running CeleryBeat Process as init script.
Our Scheduler Model has approx 3000 entries, most of them are getting scheduled every 5 mins and some are every 15 mins and few is hourly. But out of them some tasks are getting skipped i.e doesn't get schedule on time. This behaviour is randomly and happens to any of the tasks.
While digging on beat logs, we found an MySQL exception on the logs,
Traceback (most recent call last):
File "/opt/DataMonster/datamonster/db_monster/scheduler.py", line 202, in schedule_changed
transaction.commit()
File "/opt/python-2.7.11/lib/python2.7/site-packages/django/db/transaction.py", line 154, in commit
get_connection(using).commit()
File "/opt/python-2.7.11/lib/python2.7/site-packages/django/db/backends/__init__.py", line 175, in commit
self.validate_no_atomic_block()
File "/opt/python-2.7.11/lib/python2.7/site- packages/django/db/backends/__init__.py", line 381, in validate_no_atomic_block
"This is forbidden when an 'atomic' block is active.")
Have Checked the error across multiple sites and it shows it realted to isolation level. We didn't found any other exception in BEAT logs.
The isolation method used in MySQL is READ-UNCOMMITED
Need help in digging this issue.

AssertionError: INTERNAL: No default project is specified

New to airflow. Trying to run the sql and store the result in a BigQuery table.
Getting following error. Not sure where to setup the default_rpoject_id.
Please help me.
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 585, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 82, in execute
self.allow_large_results, self.udf_config, self.use_legacy_sql)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 228, in run_query
default_project_id=self.project_id)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 917, in _split_tablename
assert default_project_id is not None, "INTERNAL: No default project is specified"
AssertionError: INTERNAL: No default project is specified
Code:
sql_bigquery = BigQueryOperator(
task_id='sql_bigquery',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
#standardSQL
SELECT ID, Name, Group, Mark, RATIO_TO_REPORT(Mark) OVER(PARTITION BY Group) AS percent FROM `tensile-site-168620.temp.marks`
''',
destination_dataset_table='temp.percentage',
dag=dag
)
EDIT: I finally fixed this problem by simply adding the bigquery_conn_id='bigquery' parameter in the BigQueryOperator task, after running the code below in a separate python script.
Apparently you need to specify your project ID in Admin -> Connection in the Airflow UI. You must do this as a JSON object such as "project" : "".
Personally I can't get the webserver working on GCP so this is unfeasible. There is a programmatic solution here:
from airflow.models import Connection
from airflow.settings import Session
session = Session()
gcp_conn = Connection(
conn_id='bigquery',
conn_type='google_cloud_platform',
extra='{"extra__google_cloud_platform__project":"<YOUR PROJECT HERE>"}')
if not session.query(Connection).filter(
Connection.conn_id == gcp_conn.conn_id).first():
session.add(gcp_conn)
session.commit()
These suggestions are from a similar question here.
I get the same error when running airflow locally. My solution is to add a the following connection string as a environment variable:
AIRFLOW_CONN_BIGQUERY_DEFAULT="google-cloud-platform://?extra__google_cloud_platform__project=<YOUR PROJECT HERE>"
BigQueryOperator uses the "bigquery_default" connection. When not specified, local airflow uses an internal version of the connection which misses the property project_id. As you can see the connection string above provides the project_id property.
On startup Airflow loads environment variables that start with "AIRFLOW_" into memory. This mechanism can be used to override airflow properties and providing connections when running locally, as explained in the airflow documentation here. Note this also works when running airflow directly without starting the web server.
So I have set up environments variables for all my connections, for example AIRFLOW_CONN_MYSQL_DEFAULT. I have put them into a .ENV file that get sourced from my IDE, but putting them into your .bash_profile would work fine too.
When you look inside your airflow instance on Cloud Composer, you see that the at the "bigquery_default" connection there has the project_idproperty set. That's why BigQueryOperator works when running through Cloud Composer.
(I am on airflow 1.10.2 and BigQuery 1.10.2)

Python pikascript.py fails from command prompt

I have a script in python which is used to connect to RabbitMQ server and consumes messages. When i tried to run the script from command prompt as "./pikascript.py" i am getting the proper output but the same script when i try to execute as "python pikascript.py" i get the following error:
WARNING:pika.adapters.base_connection:Connection to 16.125.72.210:5671 failed: [Errno 1] _ssl.c:503: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Traceback (most recent call last):
File "pikascript.py", line 39, in <module>
ssl=True, ssl_options=ssl_options))
File "build\bdist.win-amd64\egg\pika\adapters\blocking_connection.py", line 130, in __init__
File "build\bdist.win-amd64\egg\pika\adapters\base_connection.py", line 72, in __init__
File "build\bdist.win-amd64\egg\pika\connection.py", line 600, in __init__
File "build\bdist.win-amd64\egg\pika\adapters\blocking_connection.py", line 230, in connect
File "build\bdist.win-amd64\egg\pika\adapters\blocking_connection.py", line 301, in _adapter_connect
pika.exceptions.AMQPConnectionError: Connection to 16.125.72.210:5671 failed: [Errno 1] _ssl.c:503: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I gave the proper path in the environmenal variables. Are there any dependencies to run the pika libraries.. Could someone please help me out.
When I tried to run the script from command line as "./pikascript.py" it is referring to the python path in "C:\Python\python.exe", but when I run the same script as "python pikascript.py" it refers to another python path in the same machine, where setup tools and pika library are not installed properly.
So I started executing the script as "C:\Python\python.exe pikascript.py" and script gets executed without any error.