Airflow allows one to pass parameters to dags via command line or via an experimental rest api. For example:
airflow trigger_dag dag_id --conf '{"parameter":"~/path" }
A commonly referenced unit test example for an operator looks like this:
https://bcb.github.io/airflow/testing-dags
class TestMyOperator(TestCase):
def test_execute(self):
dag = DAG(dag_id='foo', start_date=datetime.now())
task = MyOperator(dag=dag, task_id='foo')
ti = TaskInstance(task=task, execution_date=datetime.now())
result = task.execute(ti.get_template_context())
self.assertEqual(result, 'foo')
How should one mock command line parameters in a unit test like the one above?
I think the best solution is to make the parameter a template
Edit your MyOperator to something along these lines:
class MyOperator(BaseOperator):
template_fields = ('parameter')
#apply_defaults
def __init__(self,
parameter,
*args,
**kwargs):
super(MyOperator, self).__init__(*args, **kwargs)
self.parameter = parameter
And in your DAG:
my_operator = MyOperator(dag=dag,
parameter="{{ dag_run.conf['parameter'] }}")
In your unittest just set the parameter
Unfortunately I have not tested this in a DAG, but according to various google searches this should work. On the plus side, this makes your MyOperator more independent and can be reused other places where the DAG is not instantiated through trigger_dag
I don't know if it is possible to set configurations during a test, as you would alter a dag_run. But you could browse through Airflows test-code and you might find something: https://github.com/apache/incubator-airflow/tree/master/tests
Related
according to documentation:
For custom management commands that use options not created using
parser.add_argument(), add a stealth_options attribute on the command:
class MyCommand(BaseCommand):
stealth_options = ('option_name', ...)
but why not just add these options to parser.add_argument()? Is there any profit to use stealth_options?
Mainly for testing purpose IMO, check this code snippet taken from this example.
def inspectdb_tables_only(table_name):
"""
Limit introspection to tables created for models of this app.
Some databases such as Oracle are extremely slow at introspection.
"""
return table_name.startswith('inspectdb_')
class InspectDBTestCase(TestCase):
unique_re = re.compile(r'.*unique_together = \((.+),\).*')
def test_stealth_table_name_filter_option(self):
out = StringIO()
call_command('inspectdb', table_name_filter=inspectdb_tables_only, stdout=out)
error_message = "inspectdb has examined a table that should have been filtered out."
self.assertNotIn("class DjangoContentType(models.Model):", out.getvalue(), msg=error_message)
Now we can use stealth_options to simplify the process of bypassing call_command() check and refine our test method while hiding these options from the exposed command API.
call_command() now validates that the argument parser of the command
being called defines all of the options passed to call_command().
I have the following module that I am trying to write unit tests for.
import myModuleWithCtxMgr
def myFunc(arg1):
with myModuleWithCtxMgr.ctxMgr() as ctxMgr:
result = ctxMgr.someFunc()
if result:
return True, result
return False, None
The unit tests I'm working on looks like this.
import mock
import unittest
import myModule as myModule
class MyUnitTests(unittest.TestCase):
#mock.patch("myModuleWithCtxMgr.ctxMgr")
def testMyFunc(self, mockFunc):
mockReturn = mock.MagicMock()
mockReturn.someFunc = mock.Mock(return_value="val")
mockFunc.return_value = mockReturn
result = myModule.myFunc("arg")
self.assertEqual(result, (True, "val"))
The test is failing because result[0] = magicMock() and not the return value (I thought) I configured.
I've tried a few different variations of the test but I can't seem to be able to mock the return value of ctxMgr.someFunc(). Does anyone know how I might accomplish this?
Thanks!
The error says:
First differing element 1:
<MagicMock name='ctxMgr().__enter__().someFunc()' id='139943278730000'>
'val'
- (True, <MagicMock name='ctxMgr().__enter__().someFunc()' id='139943278730000'>)
+ (True, 'val')
The error contains the mock name which exactly shows you what needs to be mocked. Note that __enter__ corresponds to the Context Manager protocol.
This works for me:
class MyUnitTests(unittest.TestCase):
#mock.patch("myModuleWithCtxMgr.ctxMgr")
def testMyFunc(self, mockCtxMgr):
mockCtxMgr().__enter__().someFunc.return_value = "val"
result = myModule.myFunc("arg")
self.assertEqual(result, (True, "val"))
Note how each of these is a separate MagicMock instance which you can configure:
mockCtxMgr
mockCtxMgr()
mockCtxMgr().__enter__
mockCtxMgr().__enter__()
mockCtxMgr().__enter__().someFunc
MagicMocks are created lazily but have identity, so you can configure them this way and it Just Works.
I am trying make a ancestor query like this example and transfer it to template version.
The problem is that the parameter ancestor_id is for the function make_query during pipeline construction.
If I don't pass it when create and stage the template, I will get RuntimeValueProviderError: RuntimeValueProvider(option: ancestor_id, type: int).get() not called from a runtime context. But if I pass it at template creating, it seems like a StaticValueProvider that never change when I execute the template.
What is the correct way to pass parameter to template for pipeline construction?
import apache_beam as beam
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud.proto.datastore.v1 import entity_pb2
from google.cloud.proto.datastore.v1 import query_pb2
from googledatastore import helper as datastore_helper
from googledatastore import PropertyFilter
class Test(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--ancestor_id', type=int)
def make_query(ancestor_id):
ancestor = entity_pb2.Key()
datastore_helper.add_key_path(ancestor, KIND, ancestor_id)
query = query_pb2.Query()
datastore_helper.set_kind(query, KIND)
datastore_helper.set_property_filter(query.filter, '__key__', PropertyFilter.HAS_ANCESTOR, ancestor)
return query
pipeline_options = PipelineOptions()
test_options = pipeline_options.view_as(TestOptions)
with beam.Pipeline(options=pipline_options) as p:
entities = p | ReadFromDatastore(PROJECT_ID, make_query(test_options.ancestor_id.get()))
Two problems.
The ValueProvider.value.get() method can only run in a run-time method like ParDo.process(). See example.
Further, your challenge is that your are using Google Cloud Datastore IO (a query from datastore). As of today (May 2018),
the official documentation indicates that, Datastore IO is NOT accepting runtime template parameters yet.
For python, particularly,
The following connectors accept runtime parameters.
File-based IOs: textio, avroio, tfrecordio
A workaround: you probably can first run a query without any templated parameters to get a PCollection of entities. At this time, since any transformers can accept a templated parameter you might be able to use it as a filter. But this depends on your use case and it may not applicable to you.
My question is probably pretty basic but still I can't get a solution in the official doc. I have defined a Celery chain inside my Django application, performing a set of tasks dependent from eanch other:
chain( tasks.apply_fetching_decision.s(x, y),
tasks.retrieve_public_info.s(z, x, y),
tasks.public_adapter.s())()
Obviously the second and the third tasks need the output of the parent, that's why I used a chain.
Now the question: I need to programmatically revoke the 2nd and the 3rd tasks if a test condition in the 1st task fails. How to do it in a clean way? I know I can revoke the tasks of a chain from within the method where I have defined the chain (see thisquestion and this doc) but inside the first task I have no visibility of subsequent tasks nor of the chain itself.
Temporary solution
My current solution is to skip the computation inside the subsequent tasks based on result of the previous task:
#shared_task
def retrieve_public_info(result, x, y):
if not result:
return []
...
#shared_task
def public_adapter(result, z, x, y):
for r in result:
...
But this "workaround" has some flaw:
Adds unnecessary logic to each task (based on predecessor's result), compromising reuse
Still executes the subsequent tasks, with all the resulting overhead
I haven't played too much with passing references of the chain to tasks for fear of messing up things. I admit also I haven't tried Exception-throwing approach, because I think that the choice of not proceeding through the chain can be a functional (thus non exceptional) scenario...
Thanks for helping!
I think I found the answer to this issue: this seems the right way to proceed, indeed. I wonder why such common scenario is not documented anywhere, though.
For completeness I post the basic code snapshot:
#app.task(bind=True) # Note that we need bind=True for self to work
def task1(self, other_args):
#do_stuff
if end_chain:
self.request.callbacks[:] = []
....
Update
I implemented a more elegant way to cope with the issue and I want to share it with you. I am using a decorator called revoke_chain_authority, so that it can revoke automatically the chain without rewriting the code I previously described.
from functools import wraps
class RevokeChainRequested(Exception):
def __init__(self, return_value):
Exception.__init__(self, "")
# Now for your custom code...
self.return_value = return_value
def revoke_chain_authority(a_shared_task):
"""
#see: https://gist.github.com/bloudermilk/2173940
#param a_shared_task: a #shared_task(bind=True) celery function.
#return:
"""
#wraps(a_shared_task)
def inner(self, *args, **kwargs):
try:
return a_shared_task(self, *args, **kwargs)
except RevokeChainRequested, e:
# Drop subsequent tasks in chain (if not EAGER mode)
if self.request.callbacks:
self.request.callbacks[:] = []
return e.return_value
return inner
This decorator can be used on a shared task as follows:
#shared_task(bind=True)
#revoke_chain_authority
def apply_fetching_decision(self, latitude, longitude):
#...
if condition:
raise RevokeChainRequested(False)
Please note the use of #wraps. It is necessary to preserve the signature of the original function, otherwise this latter will be lost and celery will make a mess at calling the right wrapped task (e.g. it will call always the first registered function instead of the right one)
As of Celery 4.0, what I found to be working is to remove the remaining tasks from the current task instance's request using the statement:
self.request.chain = None
Let's say you have a chain of tasks a.s() | b.s() | c.s(). You can only access the self variable inside a task if you bind the task by passing bind=True as argument to the tasks' decorator.
#app.task(name='main.a', bind=True):
def a(self):
if something_happened:
self.request.chain = None
If something_happened is truthy, b and c wouldn't be executed.
I have several TestCase classes in my django application. On some of them, I mock out a function which calls external resources by decorating the class with #mock.patch, which works great. One TestCase in my test suite, let's call it B(), depends on that external resource so I don't want it mocked out and I don't add the decorator. It looks something like this:
#mock.patch("myapp.external_resource_function", new=mock.MagicMock)
class A(TestCase):
# tests here
class B(TestBase):
# tests here which depend on external_resource_function
When I test B independently, things work as expected. However, when I run both tests together, A runs first but the function is still mocked out in B. How can I unmock that call? I've tried reloading the module, but it didn't help.
Patch has start and stop methods. Based on what I can see from the code you have provided, I would remove the decorator and use the setUp and tearDown methods found in the link in your classes.
class A(TestCase):
def setUp(self):
self.patcher1 = patch('myapp.external_resource_function', new=mock.MagicMock)
self.MockClass1 = self.patcher1.start()
def tearDown(self):
self.patcher1.stop()
def test_something(self):
...
>>> A('test_something').run()
Great answer. With regard to Ethereal's question, patch objects are pretty flexible in their use.
Here's one way to approach tests that require different patches. You could still use setUp and tearDown, but not to do the patch.start/stop bit.
You start() the patches in each test and you use a finally clause to make sure they get stopped().
Patches also support Context Manager stuff so that's another option, not shown here.
class A(TestCase):
patcher1 = patch('myapp.external_resource_function', new=mock.MagicMock)
patcher2 = patch('myapp.something_else', new=mock.MagicMock)
def test_something(self):
li_patcher = [self.patcher1]
for patcher in li_patcher:
patcher.start()
try:
pass
finally:
for patcher in li_patcher:
patcher.stop()
def test_something_else(self):
li_patcher = [self.patcher1, self.patcher2]
for patcher in li_patcher:
patcher.start()
try:
pass
finally:
for patcher in li_patcher:
patcher.stop()