(Django) ORM in airflow - is it possible? - django

How to work with Django models inside Airflow tasks?
According to official Airflow documentation, Airflow provides hooks for interaction with databases (like MySqlHook / PostgresHook / etc) that can be later used in Operators for row query execution. Attaching the core code fragments:
Copy from https://airflow.apache.org/_modules/mysql_hook.html
class MySqlHook(DbApiHook):
conn_name_attr = 'mysql_conn_id'
default_conn_name = 'mysql_default'
supports_autocommit = True
def get_conn(self):
"""
Returns a mysql connection object
"""
conn = self.get_connection(self.mysql_conn_id)
conn_config = {
"user": conn.login,
"passwd": conn.password or ''
}
conn_config["host"] = conn.host or 'localhost'
conn_config["db"] = conn.schema or ''
conn = MySQLdb.connect(**conn_config)
return conn
Copy from https://airflow.apache.org/_modules/mysql_operator.html
class MySqlOperator(BaseOperator):
#apply_defaults
def __init__(
self, sql, mysql_conn_id='mysql_default', parameters=None,
autocommit=False, *args, **kwargs):
super(MySqlOperator, self).__init__(*args, **kwargs)
self.mysql_conn_id = mysql_conn_id
self.sql = sql
self.autocommit = autocommit
self.parameters = parameters
def execute(self, context):
logging.info('Executing: ' + str(self.sql))
hook = MySqlHook(mysql_conn_id=self.mysql_conn_id)
hook.run(
self.sql,
autocommit=self.autocommit,
parameters=self.parameters)
As we can see Hook incapsulates the connection configuration while Operator provides ability to execute custom queries.
The problem:
It's very convenient to use different ORM for fetching and processing database objects instead of raw SQL for the following reasons:
In straightforward cases, ORM can be a much more convenient solution, see ORM definitions.
Assume that there is already established systems like Django with defined models and their methods. Every time these models's schemas changes, airflow raw SQL queries needs to be rewritten. ORM provides a unified interface for working with such models.
For some reason, there are no examples of working with ORM in Airflow tasks in terms of hooks and operators. According to Using Django database layer outside of Django? question, it's needed to set up a connection configuration to the database, and then straight-forwardly execute queires in ORM, but doing that outside appropriate hooks / operators breaks Airflow principles. It's like calling BashOperator with "python work_with_django_models.py" command.
Finally, we want this:
So what are the best practisies in this case? Do we share any hooks / operators for Django ORM / other ORMs? In order to have the following code real (treat as pseudo-code!):
import os
import django
os.environ.setdefault(
"DJANGO_SETTINGS_MODULE",
"myapp.settings"
)
django.setup()
from your_app import models
def get_and_modify_models(ds, **kwargs):
all_objects = models.MyModel.objects.filter(my_str_field = 'abc')
all_objects[15].my_int_field = 25
all_objects[15].save()
return list(all_objects)
django_op = DjangoOperator(task_id='get_and_modify_models', owner='airflow')
instead of implementing this functionality in raw SQL.
I think it's pretty important topic, as the whole banch of ORM-based frameworks and processes are not able to dive into Airflow in this case.
Thanks in advance!

I agree we should continue to have this discussion as having access Django ORM can significantly reduce complexity of solutions.
My approach has been to 1) create a DjangoOperator
import os, sys
from airflow.models import BaseOperator
def setup_django_for_airflow():
# Add Django project root to path
sys.path.append('./project_root/')
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myapp.settings")
import django
django.setup()
class DjangoOperator(BaseOperator):
def pre_execute(self, *args, **kwargs):
setup_django_for_airflow()
and 2) Extend that DjangoOperator for logic / operators what would benefit from having access to ORM
from .base import DjangoOperator
class DjangoExampleOperator(DjangoOperator):
def execute(self, context):
from myApp.models import model
model.objects.get_or_create()
With this strategy, you can then distinguish between operators that use Raw SQL / ORM. Also note, that for the Django operator, all django model imports need to be within the execution context, demonstrated above.

Related

Django disable low level caching with context manager

One of my methods in a project I'm working on looks like this:
from django.core.cache import cache
from app import models
def _get_active_children(parent_id, timestamp):
children = cache.get(f"active_seasons_{parent_id}")
if children is None:
children = models.Children.objects.filter(parent_id=parent_id).active(
dt=timestamp
)
cache.set(
f"active_children_{parent_id}",
children,
60 * 10,
)
return children
The issue is I don't want caching to occur when this method is being called via the command line (it's inside a task). So I'm wondering if there's a way to disable caching of this form?
Ideally I want to use a context manager so that any cache calls inside the context are ignored (or pushed to a DummyCache/LocalMem cache which wouldn't effect my main Redis cache).
I've considered pasisng skip_cache=True through the methods, but this is pretty brittle and I'm sure there's a more elegant solution. Additionally, I've tried using mock.patch but I'm not sure this works outside of test classes.
My ideal solution would look something like:
def task():
...
_get_active_children(parent_id, timestamp):
with no_cache:
task()
I have a solution (but I think there's a better one out there):
from unittest.mock import patch
from django.core.cache.backends.dummy import DummyCache
from django.utils.module_loading import import_string
def no_cache(module_str, cache_object_str='cache'):
""" example usage: with no_cache('app.tasks', 'cache'): """
module_ = import_string(module_str)
return patch.object(module_, cache_object_str, DummyCache('mock', {}))
Inspired by this.

Avoiding circular imports in Django Models (Config class)

I've created a Configuration model in django so that the site admin can change some settings on the fly, however some of the models are reliant on these configurations. I'm using Django 2.0.2 and Python 3.6.4.
I created a config.py file in the same directory as models.py.
Let me paracode (paraphase the code? Real Enum has many more options):
# models.py
from .config import *
class Configuration(models.Model):
starting_money = models.IntegerField(default=1000)
class Person(models.Model):
funds = models.IntegarField(default=getConfig(ConfigData.STARTING_MONEY))
# config.py
from .models import Configuration
class ConfigData(Enum):
STARTING_MONEY = 1
def getConfig(data):
if not isinstance(data, ConfigData):
raise TypeError(f"{data} is not a valid configuration type")
try:
config = Configuration.objects.get_or_create()
except Configuration.MultipleObjectsReturned:
# Cleans database in case multiple configurations exist.
Configuration.objects.exclude(Configuration.objects.first()).delete()
return getConfig(data)
if data is ConfigData.MAXIMUM_STAKE:
return config.max_stake
How can I do this without an import error? I've tried absolute imports
You can postpone loading the models.py by loading it in the getConfig(data) function, as a result we no longer need models.py at the time we load config.py:
# config.py (no import in the head)
class ConfigData(Enum):
STARTING_MONEY = 1
def getConfig(data):
from .models import Configuration
if not isinstance(data, ConfigData):
raise TypeError(f"{data} is not a valid configuration type")
try:
config = Configuration.objects.get_or_create()
except Configuration.MultipleObjectsReturned:
# Cleans database in case multiple configurations exist.
Configuration.objects.exclude(Configuration.objects.first()).delete()
return getConfig(data)
if data is ConfigData.MAXIMUM_STAKE:
return config.max_stake
We thus do not load models.py in the config.py. We only check if it is loaded (and load it if not) when we actually execute the getConfig function, which is later in the process.
Willem Van Onsem's solution is a good one. I have a different approach which I have used for circular model dependencies using django's Applications registry. I post it here as an alternate solution, in part because I'd like feedback from more experienced python coders as to whether or not there are problems with this approach.
In a utility module, define the following method:
from django.apps import apps as django_apps
def model_by_name(app_name, model_name):
return django_apps.get_app_config(app_name).get_model(model_name)
Then in your getConfig, omit the import and replace the line
config = Configuration.objects.get_or_create()
with the following:
config_class = model_by_name(APP_NAME, 'Configuration')
config = config_class.objects.get_or_create()

Django testing of neo4j database

I'm using django with neo4j as database and noemodel as OGM. How do I test it?
When I run python3 manage.py test all the changes, my tests make are left.
And also how do I make two databases, one for testing, another for working in production and specify which one to use how?
I assume the reason all of your changes are being retained is due to using the same neo4j database for testing as you are using in development. Since neomodel isn't integrated tightly with Django it doesn't act the same way Django's ORM does when testing. Django will do some helpful things when you run tests using its ORM, such as creating a test database that will be destroyed upon completion.
With neo4j and neomodel I'd recommend doing the following:
Create a Custom Test Runner
Django enables you to define a custom test runner by setting the TEST_RUNNER settings variable. An extremely simple version of this to get you going would be:
from time import sleep
from subprocess import call
from django.test.runner import DiscoverRunner
class MyTestRunner(DiscoverRunner):
def setup_databases(self, *args, **kwargs):
# Stop your development instance
call("sudo service neo4j-service stop", shell=True)
# Sleep to ensure the service has completely stopped
sleep(1)
# Start your test instance (see section below for more details)
success = call("/path/to/test/db/neo4j-community-2.2.2/bin/neo4j"
" start-no-wait", shell=True)
# Need to sleep to wait for the test instance to completely come up
sleep(10)
if success != 0:
return False
try:
# For neo4j 2.2.x you'll need to set a password or deactivate auth
# Nigel Small's py2neo gives us an easy way to accomplish this
call("source /path/to/virtualenv/bin/activate && "
"/path/to/virtualenv/bin/neoauth "
"neo4j neo4j my-p4ssword")
except OSError:
pass
# Don't import neomodel until we get here because we need to wait
# for the new db to be spawned
from neomodel import db
# Delete all previous entries in the db prior to running tests
query = "match (n)-[r]-() delete n,r"
db.cypher_query(query)
super(MyTestRunner, self).__init__(*args, **kwargs)
def teardown_databases(self, old_config, **kwargs):
from neomodel import db
# Delete all previous entries in the db after running tests
query = "match (n)-[r]-() delete n,r"
db.cypher_query(query)
sleep(1)
# Shut down test neo4j instance
success = call("/path/to/test/db/neo4j-community-2.2.2/bin/neo4j"
" stop", shell=True)
if success != 0:
return False
sleep(1)
# start back up development instance
call("sudo service neo4j-service start", shell=True)
Add a secondary neo4j database
This can be done in a couple ways but to follow along with the test runner above you can download a community distribution from neo4j's website. With this secondary instance you can now swap between which database you'd like to use utilizing the command line statements used in the calls within the test runner.
Wrap Up
This solution assume's you're on a linux box but should be portable to a different OS with minor modifications. Also I'd recommend checking out the Django's Test Runner Docs to expand upon what the test runner can do.
There currently isn't mechanism for working with test databases in neomodel as neo4j only has 1 schema per instance.
However you can override the environment variable NEO4J_REST_URL when running the tests like so
export NEO4J_REST_URL=http://localhost:7473/db/data python3 manage.py test
The way I went about this was to give in and use the existing database, but mark all test-related nodes and detach/delete them when finished. It's obviously not ideal; all your node classes must inherit from NodeBase or risk polluting the db with test data, and if you have unique constraints, those will still be enforced across both live/test data. But it works for my purposes, and I thought I'd share in case it helps someone else.
in myproject/base.py:
from neomodel.properties import Property, validator
from django.conf import settings
class TestModeProperty(Property):
"""
Boolean property that is only set during unit testing.
"""
#validator
def inflate(self, value):
return bool(value)
#validator
def deflate(self, value):
return bool(value)
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.default = True
self.has_default = settings.UNIT_TESTING
class NodeBase(StructuredNode):
__abstract_node__ = True
test_mode = TestModeProperty()
in myproject/test_runner.py:
from django.test.runner import DiscoverRunner
from neomodel import db
class NeoDiscoverRunner(DiscoverRunner):
def teardown_databases(self, old_config, **kwargs):
db.cypher_query(
"""
MATCH (node {test_mode: true})
DETACH DELETE node
"""
)
return super().teardown_databases(old_config, **kwargs)
in settings.py:
UNIT_TESTING = sys.argv[1:2] == ["test"]
TEST_RUNNER = "myproject.test_runner.NeoDiscoverRunner"

Django test script to pre-populate DB

I'm trying to pre-populate the database with some test data for my Django project. Is there some easy way to do this with a script that's "outside" of Django?
Let's say I want to do this very simple task, creating 5 test users using the following code,
N = 10
i = 0
while i < N:
c = 'user' + str(i) + '#gmail.com'
u = lancer.models.CustomUser.objects.create_user(email=c, password="12345")
i = i + 1
The questions are,
WHERE do I put this test script file?
WHAT IMPORTS / COMMANDS do I need to put at the beginning of the file so it has access to all the Django environment & resources as if I were writing this inside the app?
I'm thinking you'd have to import and set up the settings file, and import the app's models, etc... but all my attempts have failed one way or another, so would appreciate some help =)
Thanks!
Providing another answer
The respondes below are excellent answers. I fiddled around and found an alternative way. I added the following to the top of the test data script,
from django.core.management import setup_environ
from project_lancer import settings
setup_environ(settings)
import lancer.models
Now my code above works.
I recommend you to use fixtures for these purposes:
https://docs.djangoproject.com/en/dev/howto/initial-data/
If you still want to use this initial code then read:
If you use south you can create migration and put this code there:
python manage.py schemamigration --empty my_data_migration
class Migration(SchemaMigration):
no_dry_run = False
def forwards(self, orm):
# more pythonic, you can also use bulk_insert here
for i in xrange(10):
email = "user{}#gmail.com".format(i)
u = orm.CustomUser.objects.create_user(email=email, password='12345)
You can put it to setUp method of your TestCase:
class MyTestCase(TestCase):
def setUp(self):
# more pythonic, you can also use bulk_insert here
for i in xrange(10):
email = "user{}#gmail.com".format(i)
u = lancer.models.CustomUser.objects.create_user(email=email,
password='12345')
def test_foo(self):
pass
Also you can define your BaseTestCase in which you override setUp method then you create your TestCase classes that inherit from BaseTestCase:
class BaseTestCase(TestCase):
def setUp(self):
'your initial logic here'
class MyFirstTestCase(BaseTestCase):
pase
class MySecondTestCase(BaseTestCase):
pase
But I think that fixtures is the best way:
class BaseTestCase(TestCase):
fixtures = ['users_for_test.json']
class MyFirstTestCase(BaseTestCase):
pase
class MySecondTestCase(BaseTestCase):
fixtures = ['special_users_for_only_this_test_case.json']
Updated:
python manage.py shell
from django.contrib.auth.hashers import make_password
make_password('12312312')
'pbkdf2_sha256$10000$9KQ15rVsxZ0t$xMEKUicxtRjfxHobZ7I9Lh56B6Pkw7K8cO0ow2qCKdc='
You can also use something like this or this to auto-populate your models for testing purposes.

Django with Pluggable MongoDB Storage troubles

I'm trying to use django, and mongoengine to provide the storage backend only with GridFS. I still have a MySQL database.
I'm running into a strange (to me) error when I'm deleting from the django admin and am wondering if I am doing something incorrectly.
my code looks like this:
# settings.py
from mongoengine import connect
connect("mongo_storage")
# models.py
from mongoengine.django.storage import GridFSStorage
class MyFile(models.Model):
name = models.CharField(max_length=50)
content = models.FileField(upload_to="appsfiles", storage=GridFSStorage())
creation_time = models.DateTimeField(auto_now_add=True)
last_update_time = models.DateTimeField(auto_now=True)
I am able to upload files just fine, but when I delete them, something seems to break and the mongo database seems to get in an unworkable state until I manually delete all FileDocument.objects. When this happens I can't upload files or delete them from the django interface.
From the stack trace I have:
/home/projects/vector/src/mongoengine/django/storage.py in _get_doc_with_name
doc = [d for d in docs if getattr(d, self.field).name == name] ...
▼ Local vars
Variable Value
_[1]
[]
d
docs
Error in formatting: cannot set options after executing query
name
u'testfile.pdf'
self
/home/projects/vector/src/mongoengine/fields.py in __getattr__
raise AttributeError
Am I using this feature incorrectly?
UPDATE:
thanks to #zeekay's answer I was able to get a working gridfs storage plugin to work. I ended up not using mongoengine at all. I put my adapted solution on github. There is a clear sample project showing how to use it. I also uploaded the project to pypi.
Another Update:
I'd highly recommend the django-storages project. It has lots of storage backed options and is used by many more people than my original proposed solution.
I think you are better off not using MongoEngine for this, I haven't had much luck with it either. Here is a drop-in replacement for mongoengine.django.storage.GridFSStorage, which works with the admin.
from django.core.files.storage import Storage
from django.conf import settings
from pymongo import Connection
from gridfs import GridFS
class GridFSStorage(Storage):
def __init__(self, host='localhost', port=27017, collection='fs'):
for s in ('host', 'port', 'collection'):
name = 'GRIDFS_' + s.upper()
if hasattr(settings, name):
setattr(self, s, getattr(settings, name))
for s, v in zip(('host', 'port', 'collection'), (host, port, collection)):
if v:
setattr(self, s, v)
self.db = Connection(host=self.host, port=self.port)[self.collection]
self.fs = GridFS(self.db)
def _save(self, name, content):
self.fs.put(content, filename=name)
return name
def _open(self, name, *args, **kwars):
return self.fs.get_last_version(filename=name)
def delete(self, name):
oid = fs.get_last_version(filename=name)._id
self.fs.delete(oid)
def exists(self, name):
return self.fs.exists({'filename': name})
def size(self, name):
return self.fs.get_last_version(filename=name).length
GRIDFS_HOST, GRIDFS_PORT and GRIDFS_COLLECTION can be defined in your settings or passed as host, port, collection keyword arguments to GridFSStorage in your model's FileField.
I referred to Django's custom storage documenation, and loosely followed this answer to a similar question.