Is there something like RecipeOperator in Airflow? - airflow-scheduler

I know that there is the PythonOperator in Airflow, but I was wondering if there is anything like RecipeOperator for running recipes in a DAG. If not, how could I create one ?

There is not a built in RecipeOperator. However, you do have the ability to create your own operators with Airflow:
https://airflow.apache.org/docs/stable/howto/custom-operator.html
You could create operators your commonly used recipes and then call them as you just described.

Related

What options do I have in Amazon RDS for using 'fixed_date'?

Historically, in Oracle I've used the fixed_date parameter to change the system date to run a series of reports that tie together to verify those links still are correct.
Now that we've moved to Amazon RDS, that capability is not available.
What are my options?
I've considered changing all calls to 'system_date' to use a custom function that simulates this. (Ugh, this is hundreds of packages, but is possible)
Are there better options for using fixed_date?
Seems like the only option you have is to create custom function and replace all the calls to system_date.
CREATE OR REPLACE FUNCTION fml.system_date
RETURN date
AS
BEGIN
return to_date('03-04-2021','DD-MM-YYYY');
END;
Not sure I would do this approach, but you could also investigate "stored outlines" if there are not too many queries involved. Have it call the alternate function/package instead. The fix_date call will still fail, but maybe it can be a workaround. That outline could then be used only for reports user for example.
I am not sure why Amazon doesn't support something like this yet...

How to log more frequently than evaluating with `ray.tune.Trainable`

I am interested in using the tune library for reinforcement learning and I would like to use the in-built tensorboard capability. However, the metric that I am using to tune my hyperparameters is based on a time-consuming evaluation procedure that should be run infrequently.
According to the documentation, it looks like the _train method returns a dictionary that is used both for logging and for tuning hyperparameters. Is it possible either to perform logging more frequently within the _train method? Alternately, could I return the values that I wish to log from the _train method but some of the time omit the expensive-to-compute metric from the dictionary?
One option is to use your own logging mechanism in the Trainable. You can log to the trial-specific directory (Trainable.logdir). If this conflicts with the built-in Tensorboard logging, you can remove that by setting tune.run(loggers=None).
Another option is to, as you mentioned, some of the time omit the expensive-to-compute metric from the dictionary. If you run into issues with that, you can also return "None" as the value for those metrics that you don't plan to compute in a particular iteration.
Hope that helps!

AWS boto3 -- Difference between `batch_writer` and `batch_write_item`

I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?
As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.
batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source

get application names of AT jobs from scheduled tasks

How can I get a list of all the AT scheduled tasks' application names?
But I want to know how to do it exactly by using function NetScheduleJobGetInfo and AT_INFO Structure?
I'm programming in C++
Actually, I think you have the wrong function. You should be looking at NetScheduleJobEnum() which will give you an array of AT_ENUM structs, one for each job. Inside AT_ENUM is the command associated with the task.

Where can I find a HBase cascading module for hbase-0.89.20100924+28?

I am working on a project using map reduce and HBase. We are using
Cloudera’s CDH3 distribution which has hbase-0.89.20100924+28 bundled
into it. I would like to use cascading as we have some processing that
requires multiple map reduce jobs, but I have been looking through the
different forks of the HBase adaptors for cascading on github and
can’t seem to find one for our version of HBase. Could someone point
me in the correct direction?
We are using https://github.com/ryanobjc/cascading.hbase with CDH3u1. If you have not tried that one with CDH3 give it a try.
This looks like timestamped version comming from source control. Can't you just extract it from tarball/jar file?