WSO2 CEP: Can we have if-else conditional statements in execution plans? - wso2

Can we have conditional statements like below in wso2 cep execution plan.
from stream1
select distinct attr1
insert into newStream1;
from stream2
select distinct attr2
insert into newStream2;
if
count(attr1) == count(attr2)
then
-- do something
else
-- do something else
Use case explained:
Let's say I have execution plan which takes data from 3 different streams.
Stream 1 gives data from device 1, stream 2 from device 2, ... so on.
I have a table stored in database already which stores total number of devices. In this case it stores 3 devices.
Now in the execution plan I get the data for a window of 5 minutes. And within this 5 minutes, only when I get data from all 3 streams, then only it should process the data. Else it should not.
If within 5 minutes window I get data from only 2 streams, then execution plan should discard it.

You can use filters to implement this use case. Add a query with a filter that has the 'if' part and then another query that has the 'else' condition in the filter. You can use the outputs of these queries to do different types of processing separately. You can use chains of queries for complex scenarios.

Related

Athena ignore LIMIT in some queries

I have a table with a lot of partitions (something that we're working on reducing)
When I query :
SELECT * FROM mytable LIMIT 10
I get :
"HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mytable' can potentially read more than 1000000 partitions"
Why isn't the "LIMIT 10" part of the query sufficient for Athena to return a result without reading more that 1 or 3 partitions ?
ANSWER :
During the query planing phase, Athena attempts to list all partitions potentially needed to answer the query.
Since Athena doesn't know which partitions actually contain data (not empty partitions) it will add all partitions to the list.
Athena plans a query and then executes it. During planning it lists the partitions and all the files in those partitions. However, it does not know anything about the files, how many records they contain, etc.
When you say LIMIT 10 you're telling Athena you want at most 10 records in the result, and since you don't have any grouping or ordering you want 10 arbitrary records.
However, during the planning phase Athena can't know which partitions have files in them, and how many of those files it will need to read to find 10 records. Without listing the partition locations it can't know they're not all empty, and without reading the files it can't know they're not all empty too.
Therefore Athena first has to get the list of partitions, then list each partition's location on S3, even if you say you only want 10 arbitrary records.
In this case there are so many partitions that Athena short-circuits and says that you probably didn't mean to run this kind of query. If the table had fewer partitions Athena would execute the query and each worker would read as little as possible to return 10 records and then stop – but each worker would produce 10 records, because the worker can't assume that other workers would return any records. Finally the coordinator will pick the 10 records out of all the results form all workers to return as the final result.
Limit works on the display operation only, if I am not wrong. So query will still read everything but only display 10 records.
Try to limit data using where condition, that should solve the issue
I think Athena's workers try to read max number of the partitions (relative to the partition size of the table) to get that random chunk of data and stop when query is fulfilled (which in your case, is the specification of the limit).
In your case, it's not even starting to execute the above process because of too many partitions involved. Therefore, if Athena is not planning your random data selection query, you have to explicitly plan it and hand it over to the execution engine.
Something like:
select * from mytable
where (
partition_column in (
select partition_column from mytable limit cast(10 * rand() as integer)
)
)
limit 100

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Joining Solution using Co-Group by SideInput Apache Beam

I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

How to get only the latest row from a window

I am working with Kinesis Analytics and I am trying to understand how to write my application to give me a sliding window over 24 hours. What I have generates the right data, but it looks like it regenerates it every time, which might be what it's supposed to do and my own ignorance prevents me from looking at the problem right?
What I want to do:
I have a few devices that feed a Kinesis Stream, which this Kinesis analytics application is hooked up to.
Now, when a record comes in, what I want to do is SUM a value over the last 24 hours and store that. So after Kinesis Analytics does it's job I'm connecting it to a Lambda to finalize some things.
My issue is, when I simulate sending in some data, 5 records in this case, everything runs, it runs multiple times, not 5. It LOOKS like each time a record comes in it redoes everything in the window (expected) which triggers the lambda for each row that's emitted. As the table grows, it's bad news. What I really want is just the latest value from the window from NOW - 24 HOUR, with the "id" field so I can join that "id" back to a record stored elsewhere.
My Application looks like this:
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM "id",
"timestamp_mark",
"device_id",
"x_id",
"y_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
);
Hmmm.. this might be a better idea, Do the aggregation in a sub-select and select from that. It looks like I need that second window (W2 below) to ensure I get each record that was given back out.
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM s."id",
s."timestamp_mark",
s."device_id",
s."property_a_id",
s."property_b_id",
v."value"
FROM "SOURCE_SQL_STREAM_001" OVER W2 AS s, (
SELECT STREAM "SOURCE_SQL_STREAM_001"."ROWTIME", "id",
"timestamp_mark",
"device_id",
"property_a_id",
"property_b_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
)
) AS v
WHERE s."id" = v."id"
WINDOW W2 AS (
RANGE INTERVAL '1' SECOND PRECEDING
);
Also I notice that if I restart the Kinesis Analytics application, the SUM values reset, so clearly it doesn't persist across restarts, which might make it unsuitable for this solution. I might have to just setup a SQL server and periodically delete old records.
In general using Streaming Analytics solutions (and Kinesis Analytics in particular) is recommended when you need to do something based on the data in the events and not something external like wall clock time.
The reason is simple: if you need to do something once every 24h, you create a job bringing the data from storage (DB) once, performing your task and then "going to sleep" for another 24h - no complexities, manageable overhead. Now if you need to do something based on the data (e.g. when SUM of some field across multiple events exceeds X) you are in trouble with conventional solution since there is no simple criteria for when it should run. If you run it periodically, it might be invoked many times until the data driven criteria is met, creating a clear overhead.
In the latest case Streaming Analytics solution will be used as designed and trigger your logic just when needed, minimizing the overhead.
If you prefer using Streaming Analytics (which I personally don't recommend based on description of your problem), but struggling with Kinesis Analytics syntax, you might consider using Drools Kinesis Analytics. Among its features are crons and collectors, which provide you with very simple way to trigger jobs on time basis.
Note, that my answer is biased since I'm a CTO at Streamx.

Airflow: how to get response from Big query output for data availability and based on result kick off task/subdags

Requirement is kick off dag based on data availability from upstream/dependent tables
While condition check data availability (in the tables at Big query for n number of iteration) to check data available or not. If data available then kick off subdag/task else continue in loop.
It would be great to see an clear example how to use BigQueryOperator or `BigQueryValueCheckOperator' and then execute big query something like this
{Code}
SELECT
1
FROM
WHERE
datetime BETWEEN TIMESTAMP(CURRENT_DATE())
AND TIMESTAMP(DATE_ADD(CURRENT_DATE(),1,'day'))
LIMIT
1
{Code}
If query output is 1 (that means data available for today's load) then kick off dag else continue in loop as shown in attached diagram link.
Does anyone had setup such design in Airflow dag.
You may check the BaseSensorOperator and BigQueryTableSensor to implement your own Sensor for it. https://airflow.incubator.apache.org/_modules/airflow/operators/sensors.html
Sensor operators keep executing at a time interval and succeed when a
criteria is met and fail if and when they time out.
BigQueryTableSensor just checks whether table exists or not but did check the data in the table. It might be something like this:
task1>>YourSensor>>task2