Sharing Between Operators in DAG

Sharing Between Operators in DAG - airflow-scheduler

Say I have a DAG with dag_id as run_id and the pipeline is T1 > T2 > T3 where T1,T2,T3 are Python operators.
I want to be able to pass parameters from T1 into T2 without storing them in a database/S3 and reading them back into T2, as this is long.
I know if T1 fails then T2 won't execute and looking into this I saw that the decision is based on exit code (0/1) so there doesn't seem to be a way to pass parameters through.
Does anyone know if I can pass parameters / output to the next operator without reading/writing externally? Are there any examples of this as I could not find any.

As mentioned in documentation of airflow you can use Xcoms. XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of “cross-communication”. XComs are principally defined by a key, value, and timestamp, but also track attributes like the task/DAG that created the XCom and when it should become visible. Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size. Check this link:https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
Say T1 has task id t1 and calls python function func.
def func():
return some_value
You can fetch the return value of func using xcom.
Say T2 calls some python function sample_function. Then you can fetch func return value as
def sample_function(**context):
value = context['task_instance'].xcom_pull(task_ids='t1')
Note, you need to pass context in your function, this can be done by mentioning provide_context=True in operator.

Related

Running multiple functions

I think I have confused myself on how I should approach this.
I have a number of functions that I use to interact with an api, for example get product ID, update product detail, update inventory. These calls need to be done one after another, and are all wrapped up in one function api.push().
Let's say I need to run api.push() 100 times, 100 product IDs
What I want to do is run many api.push at the same time, so that I can speed up the processing of my. For example, lets say I want to run 5 at a time.
I am confused to whether this is multiprocessing or threading, or neither. I tried both but they didn't seem to work, for example I have this
jobs = []
for n in range(0, 4):
print "adding a job %s" % n
p = multiprocessing.Process(target=api.push())
jobs.append(p)
# Starts threads
for job in jobs:
job.start()
for job in jobs:
job.join()
Any guidance would be appreciated
Thanks

Please read the python doc and do some research on the global interpreter lock to see whether you should use threading or multiprocessing in your situation.
I do not know the inner workings of api.push, but please note that you should pass a function reference to multiprocessing.Process.
Using p = multiprocessing.Process(target=api.push()) will pass whatever api.push() returns as the function to be called in the subprocesses.
if api.push is the function to be called in the subprocess, you should use p = multiprocessing.Process(target=api.push) instead, as it passes a reference to the function rather than a reference to the result of the function.

How does Luigi's require work?

I am using Luigi tool of Spotify to handle dependencies between several jobs.
def require(self):
yield task1()`
info = retrieve_info()
yield task2(info=info)
In my example, I'd like to require from task1, then retrieve some information that depends on the execution of task1 in order to pass it as an argument of task2. However, my function retrieve_info won't work because task1 has not runned yet.
My question is, since I am using yield, task1 should not process before the call of retrieve_info is made? Is Luigi iterating over the required function and then launching the processing of the different task?
If this last assumption is right, how can I use execution of a required task as an input of a second required class?

get number of pending tasks for a specific user

In one of my applications i want to limit users to make a only a specific number of document conversion each calendar month and want to notify them of the conversions they've made and number of conversions they can still make in that calendar month.
So I do something like the following.
class CustomUser(models.Model):
# user fields here
def get_converted_docs(self):
return self.document_set.filter(date__range=[start, end]).count()
def remaining_docs(self):
converted = self.get_converted_docs()
return LIMIT - converted
Now, document conversion is done in the background using celery. So there may be a situation when a conversion task is pending, so in that case the above methods would let a user make an extra conversion, because the pending task is not being included in the count.
How can i get the number of tasks pending for a specific CustomUser object here ??
update
ok so i tried the following:
from celery.task.control import inspect
def get_scheduled_tasks():
tasks = []
scheduled = inspect().scheduled()
for task in scheduled.values()
tasks.extend(task)
return tasks
This gives me a list of scheduled tasks but now all the values are unicode for the above mentioned task args look like this:
u'args': u'(<Document: test_document.doc>, <CustomUser: Test User>)'
is there a way these can be decoded back to original django objects so that i can filter them ?

Store the state of your documents somewhere else, don't inspect your queue.
Either create a seperate model for that, or eg. have a state on your document model, at least independently from your queue. This should have several advantages:
Inspecting the queue might be expensive - also depending on the backend for that. And as you see it can also turn out to be difficult.
Your queue might not be persistent, if eg. your server crashes and use something like Redis you would loose this information, so it's a good thing to have a log somewhere else to be able to reconstruct the queue)

How does Sentry aggregate errors?

I am using Sentry (in a django project), and I'd like to know how I can get the errors to aggregate properly. I am logging certain user actions as errors, so there is no underlying system exception, and am using the culprit attribute to set a friendly error name. The message is templated, and contains a common message ("User 'x' was unable to perform action because 'y'"), but is never exactly the same (different users, different conditions).
Sentry clearly uses some set of attributes under the hood to determine whether to aggregate errors as the same exception, but despite having looked through the code, I can't work out how.
Can anyone short-cut my having to dig further into the code and tell me what properties I need to set in order to manage aggregation as I would like?
[UPDATE 1: event grouping]
This line appears in sentry.models.Group:
class Group(MessageBase):
"""
Aggregated message which summarizes a set of Events.
"""
...
class Meta:
unique_together = (('project', 'logger', 'culprit', 'checksum'),)
...
Which makes sense - project, logger and culprit I am setting at the moment - the problem is checksum. I will investigate further, however 'checksum' suggests that binary equivalence, which is never going to work - it must be possible to group instances of the same exception, with differenct attributes?
[UPDATE 2: event checksums]
The event checksum comes from the sentry.manager.get_checksum_from_event method:
def get_checksum_from_event(event):
for interface in event.interfaces.itervalues():
result = interface.get_hash()
if result:
hash = hashlib.md5()
for r in result:
hash.update(to_string(r))
return hash.hexdigest()
return hashlib.md5(to_string(event.message)).hexdigest()
Next stop - where do the event interfaces come from?
[UPDATE 3: event interfaces]
I have worked out that interfaces refer to the standard mechanism for describing data passed into sentry events, and that I am using the standard sentry.interfaces.Message and sentry.interfaces.User interfaces.
Both of these will contain different data depending on the exception instance - and so a checksum will never match. Is there any way that I can exclude these from the checksum calculation? (Or at least the User interface value, as that has to be different - the Message interface value I could standardise.)
[UPDATE 4: solution]
Here are the two get_hash functions for the Message and User interfaces respectively:
# sentry.interfaces.Message
def get_hash(self):
return [self.message]
# sentry.interfaces.User
def get_hash(self):
return []
Looking at these two, only the Message.get_hash interface will return a value that is picked up by the get_checksum_for_event method, and so this is the one that will be returned (hashed etc.) The net effect of this is that the the checksum is evaluated on the message alone - which in theory means that I can standardise the message and keep the user definition unique.
I've answered my own question here, but hopefully my investigation is of use to others having the same problem. (As an aside, I've also submitted a pull request against the Sentry documentation as part of this ;-))
(Note to anyone using / extending Sentry with custom interfaces - if you want to avoid your interface being use to group exceptions, return an empty list.)

See my final update in the question itself. Events are aggregated on a combination of 'project', 'logger', 'culprit' and 'checksum' properties. The first three of these are relatively easy to control - the fourth, 'checksum' is a function of the type of data sent as part of the event.
Sentry uses the concept of 'interfaces' to control the structure of data passed in, and each interface comes with an implementation of get_hash, which is used to return a hash value for the data passed in. Sentry comes with a number of standard interfaces ('Message', 'User', 'HTTP', 'Stacktrace', 'Query', 'Exception'), and these each have their own implemenation of get_hash. The default (inherited from the Interface base class) is a empty list, which would not affect the checksum.
In the absence of any valid interfaces, the event message itself is hashed and returned as the checksum, meaning that the message would need to be unique for the event to be grouped.

I've had a common problem with Exceptions. Currently our system is capturing only exceptions and I was confused why some of these where merged into a single error, others are not.
With your information above I extraced the "get_hash" methods and tried to find the differences "raising" my errors. What I found out is that the grouped errors all came from a self written Exception type that has an empty Exception.message value.
get_hash output:
[<class 'StorageException'>, StorageException()]
and the multiple errors came from an exception class that has a filled message value (jinja template engine)
[<class 'jinja2.exceptions.UndefinedError'>, UndefinedError('dict object has no attribute LISTza_*XYZ*',)]
Different exception messages trigger different reports, in my case the merge was caused due to the lack of the Exception.message value.
Implementation:
class StorageException(Exception):
def __init__(self, value):
Exception.__init__(self)
self.value = value

Asynchronous network calls

I made a class that has an asynchronous OpenWebPage() function. Once you call OpenWebPage(someUrl), a handler gets called - OnPageLoad(reply). I have been using a global variable called lastAction to take care of stuff once a page is loaded - handler checks what is the lastAction and calls an appropriate function. For example:
this->lastAction == "homepage";
this->OpenWebPage("http://www.hardwarebase.net");
void OnPageLoad(reply)
{
if(this->lastAction == "homepage")
{
this->lastAction = "login";
this->Login(); // POSTs a form and OnPageLoad gets called again
}
else if(this->lastAction == "login")
{
this->PostLogin(); // Checks did we log in properly, sets lastAction as new topic and goes to new topic URL
}
else if(this->lastAction == "new topic")
{
this->WriteTopic(); // Does some more stuff ... you get the point
}
}
Now, this is rather hard to write and keep track of when we have a large number of "actions". When I was doing stuff in Python (synchronously) it was much easier, like:
OpenWebPage("http://hardwarebase.net") // Stores the loaded page HTML in self.page
OpenWebpage("http://hardwarebase.net/login", {"user": username, "pw": password}) // POSTs a form
if(self.page == ...): // now do some more checks etc.
// do something more
Imagine now that I have a queue class which holds the actions: homepage, login, new topic. How am I supposed to execute all those actions (in proper order, one after one!) via the asynchronous callback? The first example is totally hard-coded obviously.
I hope you understand my question, because frankly I fear this is the worst question ever written :x
P.S. All this is done in Qt.

You are inviting all manner of bugs if you try and use a single member variable to maintain state for an arbitrary number of asynchronous operations, which is what you describe above. There is no way for you to determine the order that the OpenWebPage calls complete, so there's also no way to associate the value of lastAction at any given time with any specific operation.
There are a number of ways to solve this, e.g.:
Encapsulate web page loading in an immutable class that processes one page per instance
Return an object from OpenWebPage which tracks progress and stores the operation's state
Fire a signal when an operation completes and attach the operation's context to the signal

You need to add "return" statement in the end of every "if" branch: in your code, all "if" branches are executed in the first OnPageLoad call.
Generally, asynchronous state mamangment is always more complicated that synchronous. Consider replacing lastAction type with enumeration. Also, if OnPageLoad thread context is arbitrary, you need to synchronize access to global variables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sharing Between Operators in DAG - airflow-scheduler

Related

Running multiple functions

How does Luigi's require work?

get number of pending tasks for a specific user

How does Sentry aggregate errors?

Asynchronous network calls

Categories

Resources