Airflow: how to get response from Big query output for data availability and based on result kick off task/subdags - python-2.7

Requirement is kick off dag based on data availability from upstream/dependent tables
While condition check data availability (in the tables at Big query for n number of iteration) to check data available or not. If data available then kick off subdag/task else continue in loop.
It would be great to see an clear example how to use BigQueryOperator or `BigQueryValueCheckOperator' and then execute big query something like this
{Code}
SELECT
1
FROM
WHERE
datetime BETWEEN TIMESTAMP(CURRENT_DATE())
AND TIMESTAMP(DATE_ADD(CURRENT_DATE(),1,'day'))
LIMIT
1
{Code}
If query output is 1 (that means data available for today's load) then kick off dag else continue in loop as shown in attached diagram link.
Does anyone had setup such design in Airflow dag.

You may check the BaseSensorOperator and BigQueryTableSensor to implement your own Sensor for it. https://airflow.incubator.apache.org/_modules/airflow/operators/sensors.html
Sensor operators keep executing at a time interval and succeed when a
criteria is met and fail if and when they time out.
BigQueryTableSensor just checks whether table exists or not but did check the data in the table. It might be something like this:
task1>>YourSensor>>task2

Related

Q: AWS Redshift: ANALYZE COMPRESSION 'Table Name' - How to save the result set into a table / Join to other Table

I am looking for a way to save the result set of an ANALYZE Compression to a table / Joining it to another table in order to automate compression scripts.
Is it Possible and how?
You can always run analyze compression from an external program (bash script is my go to), read the results and store them back up to Redshift with inserts. This is usually the easiest and fastest way when I run into these type of "no route from leader to compute node" issues on Redshift. These are often one-off scripts that don't need automation or support.
If I need something programatic I'll usually write a Lambda function (or possibly a python program on an ec2). Fairly easy and execution speed is high but does require an external tool and some users are not happy running things outside of the database.
If it needs to be completely Redshift internal then I make a procedure that keeps the results of the leader only query in cursor and then loops on the cursor inserting the data into a table. Basically the same as reading it out and then inserting back in but the data never leaves Redshift. This isn't too difficult but is slow to execute. Looping on a cursor and inserting 1 row at a time is not efficient. Last one of these I did took 25 sec for 1000 rows. It was fast enough for the application but if you need to do this on 100,000 rows you will be waiting a while. I've never done this with analyze compression before so there could be some issue but definitely worth a shot if this needs to be SQL initiated.

Joining Solution using Co-Group by SideInput Apache Beam

I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

BigQuery - Inequality vs Equality Joins - maximumBillingTier

Here is the inequality condition that I have in my join (simple overlap conditions):
ON
(A.start <= B.End) AND (B.Start <= A.END)
It gives me the following error:
java.lang.RuntimeException:
BigQueryError{reason=billingTierLimitExceeded, location=null,
message=Query exceeded resource limits. 700920.3330645757 CPU seconds
were used, and this query must use less than 529900.0 CPU seconds.
Surprisingly, this operation takes more than running the sequential algorithm (w/o any join) on a single instance (n1-highmem-16).
I have a couple of questions:
1) How can I calculate maximumBillingTier for my query?
2) Can someone explain how inequality joins work in BigQuery?
3) Why inequality joins are so expensive?
Is it because of number of operations, or is it because of large number of outputs?
For the same query and input tables, inequality joins takes more than 13000 seconds and eventually gets canceled due to time-out, but if I change the condition to only cover equality, it would take only 70 secs.
Thanks!
1) How can I calculate maximumBillingTier for my query?
I think this goes down to the notion of Slots
A BigQuery slot is a unit of computational capacity required to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity.
The default number of slots for on-demand queries is shared among all queries in a single project. As a rule, if you're processing less than 100 GB of queries at once, you're unlikely to be using all 2,000 slots.
To check how many slots you're using, see Monitoring BigQuery Using Stackdriver.
See more details at Query Jobs Quotas
2) Can someone explain how inequality joins work in BigQuery?
This can really depends on data size and distribution
I would recommend Query Plan Explanation - it can help not only in understanding what is going on under-hood but also will help you to optimize your query

How to change the delay from ASA to PowerBi

I have a simple query in ASA from an IoT Hub input to send an average calculation each second to powerbi. I can see that the first data comes to PowerBi 15-20 seconds after IoT Hub receives the input.
Is there anything I can do to decrease this delay?
Query:
SELECT AVG(CAST(acctotal as float)) as average_shake,
CAST(MAX(eventTime) as datetime) as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY eventTime
GROUP BY TumblingWindow(second, 1)
Event Ordering settings are kept to default values
Late arrival Days:00, Hours:00, Minutes:00, Seconds:05
Out of order Minutes:00, Seconds:00
Action: Adjust
If you use the system timestamp instead of event time, I think you will see the delay go away. Try just removing the line "TIMESTAMP BY eventTime"
You can get system time - i.e. the timestamp given to the event as it flows through ASA - through:
SELECT System.Timestamp
As documented in MSDN.
Building onto Josh's response: perhaps you could try something like:
SELECT AVG(CAST(acctotal as float)) as average_shake,
System.Timestamp as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY time
GROUP BY TumblingWindow(second, 1)
What is the volume of your input events and what is the number of IoTHub partitions? ASA merges data form IOTHub partitions and arranges events by time to compute aggregation defined in the query. If you have many partitions and relatively small number of events, there could be additional delays as some IoTHub partitions may not have data and ASA will be waiting for the data to appear (max delay is controlled by late arrival policy).
If this is the case, you may want to use fewer IoTHub partitions.
In general, you will see smaller latency in ASA when you process partitions in parallel (use PARTITION BY clause). The drawback is that you will end up with partial aggregate values per partition. You can probably aggregate them further in PowerBI.

How to find resource intensive and time consuming queries in WX2?

Is there a way to find the resource intensive and time consuming queries in WX2?
I tried to check SYS.IPE_COMMAND and SYS.IPE_TRANSACTION tables but of no help.
The best way to identify such queries when they are still running is to connect as SYS with Kognitio Console and use Tools | Identify Problem Queries. This runs a number of queries against Kognitio virtual tables to understand how long current queries have been running, how much RAM they are using, etc. The most intensive queries are at the top of the list, ranked by the final column, "Relative Severity".
For queries which ran in the past, you can look in IPE_COMMAND to see duration but only for non-SELECT queries - this is because SELECT queries default to only logging the DECLARE CURSOR statement, which basically just measures compile time rather than run time. To see details for SELECT queries you should join to IPE_TRANSACTION to find the start and end time for the transaction.
For non-SELECT queries, IPE_COMMAND contains a breakdown of the time taken in a number of columns (all times in ms):
SM_TIME shows the compile time
TM_TIME shows the interpreter time
QUEUE_TIME shows the time the query was queued
TOTAL_TIME aggregates the above information
If it is for historic view image commands as mentioned in the comments, you can query
... SYS.IPE_COMMAND WHERE COMMAND IMATCHING 'create view image' AND TOTAL_TIME > 300000"
If it is for currently running commands you can look in SYS.IPE_CURTRANS and join to IPE_TRANSACTION to find the start time of the transaction (assuming your CVI runs in its own transaction - if not, you will need to look in IPE_COMMAND to find when the last statement in this TNO completed and use that as the start time)