Airflow kubernetesPorOperator example fails to run - airflow-scheduler

Trying to run sample kubernetesPodOperator retrieves:
[2020-05-25 20:00:40,475] {{init.py:51}} INFO - Using executor
LocalExecutor
[2020-05-25 20:00:40,475] {{dagbag.py:396}} INFO - Filling up the
DagBag from /usr/local/airflow/dags/kubernetes_example.py
│ │ Traceback (most recent call last):
│ │ File "/usr/local/bin/airflow", line 37, in
│ │ args.func(args)
│ │ File
"/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line
75, in wrapper
│ │ return f(*args, **kwargs)
│ │ File
"/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 523,
in run
│ │ dag = get_dag(args)
│ │ File
"/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 149,
in get_dag
│ │ 'parse.'.format(args.dag_id))
│ │ airflow.exceptions.AirflowException: dag_id could not be found:
kubernetes_example. Either the dag did not exist or it failed to
parse.
This is the code I am using:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=60)
}
dag = DAG(
'kubernetes_example', default_args=default_args, schedule_interval=timedelta(minutes=60))
start = DummyOperator(task_id='run_this_first', dag=dag)
passing = KubernetesPodOperator(namespace='airflow',
image="python:3.6.10",
cmds=["Python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
name="passing-test",
task_id="passing-task",
env_vars={'EXAMPLE_VAR': '/example/value'},
in_cluster=True,
get_logs=True,
dag=dag
)
failing = KubernetesPodOperator(namespace='airflow',
image="ubuntu:18.04",
cmds=["Python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
name="fail",
task_id="failing-task",
get_logs=True,
dag=dag
)
passing.set_upstream(start)
failing.set_upstream(start)
I just took it from sample executor.
Did someone stumble upon this issue?
Thanks!

You need to have a name (the dag_id for your dag).
dag = DAG(
dag_id='kubernetes_example',
default_args=default_args,
schedule_interval=timedelta(minutes=60)
)
Also your task_id should have _ not - and be: task_id="failing_task"

Related

How to specify celery beat periodic task?

I want to make a periodic task with Django and Celery. I have configured the celery in my project.
The project structure looks like this:
project
├── apps
│   ├── laws
│ └──tasks
│ └──periodic.py # the task is in this file
├── config
│   ├── celery.py
│   ├── settings
│ └── base.py # CELERY_BEAT_SCHEDULE defined in this file
base.py:
CELERY_BEAT_SCHEDULE = {
"sample_task": {
"task": "apps.laws.tasks.periodic.SampleTask", # the problem is in the line
"schedule": crontab(minute="*/1"),
},
}
periodic.py:
class SampleTask(Task):
name="laws.sample_task"
def run(self, operation, *args, **kwargs):
logger.info("The sample task in running...")
Here is the error I get:
The delivery info for this task is:
{'exchange': '', 'routing_key': 'celery'}
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/celery/worker/consumer/consumer.py", line 591, in on_task_received
strategy = strategies[type_]
KeyError: 'apps.laws.tasks.periodic.SampleTask'
How can I define the route of the task correctly?
I solved it.
instead of the route, just put the name of the task in settings:
CELERY_BEAT_SCHEDULE = {
"laws.sample_task": {
"task": "laws.sample_task", # put the name here.
"schedule": crontab(minute="*/1"),
},
}

Django: How to join two models and get the fields of both as a result?

My models :
I have 2 (simplified) models:
class Message(models.Model):
mmsi = models.IntegerField()
time = models.DateTimeField()
class LastMessage(models.Model):
mmsi = models.IntegerField()
time = models.DateTimeField()
From the most recent Message, I want to check if there are Message not in LastMessage.
I can get the most recent Message like that :
recent_messages = (Message.objects.distinct('mmsi')
.order_by('mmsi', '-time'))
From there, I have no idea how to extract the informations I want from recent_message and LastMessage.
I'd typically like to have the information displayed like this :
MMSI | Message_Time | LastMessage_Time
I can do that with SQL like this for example :
SELECT r.mmsi, r.time as Message_Time, l.time as LastMessage_Time
FROM recent_message as r
INNER JOIN core_lastmessage as l on r.mmsi = l.mmsi
WHERE r.time <> l.time LIMIT 10;
┌─────────┬────────────────────────┬────────────────────────┐
│ mmsi │ message_time │ lastmessage_time │
├─────────┼────────────────────────┼────────────────────────┤
│ 2000000 │ 2019-09-10 10:42:03+02 │ 2019-09-10 10:26:26+02 │
│ 2278000 │ 2019-09-10 10:42:24+02 │ 2019-09-10 10:40:18+02 │
│ 2339002 │ 2019-09-10 10:42:06+02 │ 2019-09-10 10:33:02+02 │
│ 2339004 │ 2019-09-10 10:42:06+02 │ 2019-09-10 10:30:07+02 │
│ 2417806 │ 2019-09-10 10:39:19+02 │ 2019-09-10 10:37:02+02 │
│ 2417807 │ 2019-09-10 10:41:18+02 │ 2019-09-10 10:36:55+02 │
│ 2417808 │ 2019-09-10 10:42:23+02 │ 2019-09-10 10:30:39+02 │
│ 2470087 │ 2019-09-10 10:42:23+02 │ 2019-09-10 10:39:13+02 │
│ 3160184 │ 2019-09-10 10:42:03+02 │ 2019-09-10 10:28:30+02 │
│ 3604482 │ 2019-09-10 10:42:10+02 │ 2019-09-10 10:35:29+02 │
└─────────┴────────────────────────┴────────────────────────┘
(here recent_message is just a temporary table for conveniance)
How could I do that in Django ?
Thanks !

Scrapy integration with DjangoItem yields error

I am trying to run scrapy with DjangoItem. When i run crawl my spider, I get the 'ExampleDotComItem does not support field: title' error. I have created multiple projects and tried to get it to work but always get the same error. I found this tutorial and downloaded the source code, and after running it; I get the same error:
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py",line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\A\Desktop\django1.7-scrapy1.0.3-master\example_bot\example_bot\spiders\example.py", line 12, in parse
return ExampleDotComItem(title=title, description=description)
File "c:\programdata\anaconda3\lib\site-packages\scrapy_djangoitem__init__.py", line 29, in init
super(DjangoItem, self).init(*args, **kwargs)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\item.py", line 56, in init
self[k] = v
File "c:\programdata\anaconda3\lib\site-packages\scrapy\item.py", line 66,
in setitem
(self.class.name, key)) KeyError: 'ExampleDotComItem does not support field: title'
Project structure:
├───django1.7-scrapy1.0.3-master
├───example_bot
│ └───example_bot
│ ├───spiders
│ │ └───__pycache__
│ └───__pycache__
└───example_project
├───app
│ ├───migrations
│ │ └───__pycache__
│ └───__pycache__
└───example_project
└───__pycache__
My Django Model:
from django.db import models
class ExampleDotCom(models.Model):
title = models.CharField(max_length=255)
description = models.CharField(max_length=255)
def __str__(self):
return self.title
My "example" Spider:
from scrapy.spiders import BaseSpider
from example_bot.items import ExampleDotComItem
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ['http://www.example.com/']
def parse(self, response):
title = response.xpath('//title/text()').extract()[0]
description = response.xpath('//body/div/p/text()').extract()[0]
return ExampleDotComItem(title=title, description=description)
Items.py:
from scrapy_djangoitem import DjangoItem
from app.models import ExampleDotCom
class ExampleDotComItem(DjangoItem):
django_model = ExampleDotCom
pipelines.py:
class ExPipeline(object):
def process_item(self, item, spider):
print(item)
item.save()
return item
settings.py:
import os
import sys
DJANGO_PROJECT_PATH = '/Users/A/DESKTOP/django1.7-scrapy1.0.3-master/example_project'
DJANGO_SETTINGS_MODULE = 'example_project.settings' #Assuming your django application's name is example_project
sys.path.insert(0, DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = DJANGO_SETTINGS_MODULE
BOT_NAME = 'example_bot'
import django
django.setup()
SPIDER_MODULES = ['example_bot.spiders']
ITEM_PIPELINES = {
'example_bot.pipelines.ExPipeline': 1000,
}
Can you show your Django model? This is likely occurring because title isn't defined on your ExampleDotCom model.
If it is there, perhaps you need to run your Django migrations?

Airflow Exception: Dataflow failed with return code 2

I am trying to execute a dataflow python file that reads a text file from a GCS bucket through an airflow DAG using its DataFlowPythonOperator. I have been able to execute the python file independently but it fails when I execute it through airflow. I am using a service account to authenticate for my default gcp connection.
The error I get when executing the job is:
{gcp_dataflow_hook.py:108} INFO - Start waiting for DataFlow process to complete.
{models.py:1417} ERROR - DataFlow failed with return code 2
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 182, in execute
self.py_file, self.py_options)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_python_dataflow
task_id, variables, dataflow, name, ["python"] + py_options)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 138, in _start_dataflow
_Dataflow(cmd).wait_for_done()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 119, in wait_for_done
self._proc.returncode))
Exception: DataFlow failed with return code 2
My airflow script:
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataFlowPythonOperator
from datetime import datetime, timedelta
# Default DAG parameters
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': <email>,
'email_on_failure': False,
'email_on_retry': False,
'start_date': datetime(2018, 4, 30),
'retries': 1,
'retry_delay': timedelta(minutes=1),
'dataflow_default_options': {
'project': '<Project ID>'
}
}
dag = DAG(
dag_id='df_dag_readfromgcs',
default_args=default_args,
schedule_interval=timedelta(minutes=60)
)
task1 = DataFlowPythonOperator(
task_id='task1',
py_file='~<path>/1readfromgcs.py',
gcp_conn_id='default_google_cloud_connection',
dag=dag
)
My Dataflow python file (1readfromgcs.py) contains the following code:
from __future__ import absolute_import
import argparse
import logging
import apache_beam as beam
import apache_beam.pipeline as pipeline
import apache_beam.io as beamio
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
def runCode(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('--input',
default='<Input file path>',
help='File name')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--project=<project name>',
'--runner=DataflowRunner',
'--job_name=<job name>',
'--region=europe-west1',
'--staging_location=<GCS staging location>',
'--temp_location=<GCS temp location>'
])
pipeline_options = PipelineOptions(pipeline_args)
p = beam.pipeline.Pipeline(options=pipeline_options)
rows = p | 'read' >> beam.io.ReadFromText(known_args.input)
p.run().wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
runCode()
I am unable to debug and figure out the reason for this exception and as per my investigation in Airflow: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/gcp_dataflow_hook.py file, the error is arising from the following lines:
def wait_for_done(self):
reads = [self._proc.stderr.fileno(), self._proc.stdout.fileno()]
self.log.info("Start waiting for DataFlow process to complete.")
while self._proc.poll() is None:
ret = select.select(reads, [], [], 5)
if ret is not None:
for fd in ret[0]:
line = self._line(fd)
self.log.debug(line[:-1])
else:
self.log.info("Waiting for DataFlow process to complete.")
if self._proc.returncode is not 0:
raise Exception("DataFlow failed with return code {}".format(
self._proc.returncode))
Appreciate your thoughts and help with my issue.
This exception stems from _proc which is a subprocess. It returns an exit code from a shell.
I haven't worked with this component yet. Depending on what is being executed this exit code 2 will tell about the reason of the exit. E.g. this exit code in bash means:
Misuse of shell builtins
and could be connected to
Missing keyword or command, or permission problem
So it might be connected to the underlying DataFlow configuration. Try manually executing the file while impersonating the user airflow.

Adding a third-party library (twilio) to project using Google App Engine and Django

Everyone.
I'm a newbie in this field. I develops web application with google app engine using django framework. I have a troubleshot about python lib dir problem... ImportError: no module named...
my appengine_config.py file is
# [START vendor]
from google.appengine.ext import vendor
vendor.add('lib') # I believes this line is to add 'lib' folder to PATH.
# vendor.add(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib')) # <-- and I tried too this line.
# [END vendor]
my 'requirements.txt' file is
MySQL-python==1.2.5 #app engine django project default
Django==1.11.3 #app engine django project default
django-twilio # add i want
twilio # add i want
and I installed using pip install -t lib -r requirements.txt
ROOT
├── lib
│ ├── django
│ ├── pytz
│ ├── wanttousing_lib
│ └── ...
├── mysite
│ ├── __init__.py
│ ├── settings.py
│ ├── controllers.py
│ ├── models.py
│ ├── views.py
│ ├── templates
│ └── ....
├── test
│ ├── like
│ │ ├── models_tests.py
│ │ └── controllers_tests.py
│ └── ....
├── static
│ ├── css
│ └── js
├── app.yaml
├── manage.py
├── appengine_config.py
├── requirement-vendor.txt
└── requirements.txt
so, I installed in my project... but..compiled error.
from wanttousing_lib import example_module
importError wanttousing_lib..........
however, if I move my wanttousing_lib to ROOT dir, it works.....
ROOT
├── lib
│ ├── django
│ ├── pytz
│
│ └── ...
├── mysite
│ ├── __init__.py
│ ├── settings.py
│ ├── controllers.py
│ ├── models.py
│ ├── views.py
│ ├── templates
│ │ └── like
│ │ ├── index.html
│ │ └── _likehelpers.html
│ └── ....
├── test
│ ├── like
│ │ ├── models_tests.py
│ │ └── controllers_tests.py
│ └── ....
├── static
│ ├── css
│ └── js
├── app.yaml
├── manage.py
├── appengine_config.py
├── requirement-vendor.txt
├── requirements.txt
└── wanttousing_lib <--- moved
--> All traceback.
Unhandled exception in thread started by <function wrapper at 0x103e0eaa0>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/utils/autoreload.py", line 227, in wrapper
fn(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/management/commands/runserver.py", line 125, in inner_run
self.check(display_num_errors=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/management/base.py", line 359, in check
include_deployment_checks=include_deployment_checks,
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/management/base.py", line 346, in _run_checks
return checks.run_checks(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/checks/registry.py", line 81, in run_checks
new_errors = check(app_configs=app_configs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/checks/urls.py", line 16, in check_url_config
return check_resolver(resolver)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/core/checks/urls.py", line 26, in check_resolver
return check_method()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/urls/resolvers.py", line 254, in check
for pattern in self.url_patterns:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/urls/resolvers.py", line 405, in url_patterns
patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/urls/resolvers.py", line 398, in urlconf_module
return import_module(self.urlconf_name)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "ROOT/mysite/urls.py", line 19, in <module>
from polls.views import index
File "ROOT/polls/views.py", line 17, in <module>
from sms_twilio.tests import send_sms_test
File "ROOT/sms_twilio/tests.py", line 13, in <module>
from twilio import twiml
ImportError: No module named twilio
ERROR SOURCE:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from django.test import TestCase
# Create your tests here.
from django.conf import settings
# file: your_code.py
# import twilio # no need for 'from lib import twilio'
# do stuff with twilio...
from twilio import twiml
from twilio.rest import Client
def send_twilio_message(to_number, body):
client = Client(
#client = twilio.rest.TwilioRestClient(
settings.TWILIO_ACCOUNT_SID, settings.TWILIO_AUTH_TOKEN)
return client.messages.create(
body=body,
to=to_number,
from_=settings.TWILIO_PHONE_NUMBER
)
def send_sms_test():
client = Client(
#client = twilio.rest.TwilioRestClient(
settings.TWILIO_ACCOUNT_SID, settings.TWILIO_AUTH_TOKEN)
return client.messages.create(
body="[TEST] SEND SMS !! HELLO !!",
to="TO_SENDER",
from_=settings.TWILIO_PHONE_NUMBER
)
perhaps, Do I add library list to app.yaml ?
like
libraries:
- name: MySQLdb
version: 1.2.5
- name: twilio <-- like this
version: -
requirement-vendor.txt file is
Django==1.11.3
how can i fix it? please help...
I had a similar issue a while back, and instead of using vendor.add('lib'), I had success doing this:
vendor.add(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib'))
I found that solution here in the docs for using third-party libraries on GAE, in step 4:
To copy a library into your project:
Create a directory to store your third-party libraries, such as lib/.
mkdir lib
Use pip (version 6 or later) with the -t flag to copy the libraries into the folder you created in the previous step. For example:
pip install -t lib/ <library_name>
(Using Homebrew Python on Mac OS X?)
Create a file named appengine_config.py in the same folder as your app.yaml file.
Edit the appengine_config.py file and provide your library directory to the vendor.add() method.
# appengine_config.py
from google.appengine.ext import vendor
# Add any libraries install in the "lib" folder.
vendor.add('lib')
The appengine_config.py file above assumes that the current working directory is where the lib folder is located. In some cases, such as unit tests, the current working directory can be different. To avoid errors, you can explicity pass in the full path to the lib folder using:
vendor.add(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib'))
My python lib's dir are two dir.
1) /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/...
2) /usr/local/lib/python2.7/...
My project points at 1), but pip install at 2)...
I tried at 1) ./pip install twilio. and so, it works!
thanks.