Azure Stream Analytics with two event hub inputs joins - azure-eventhub

I have two Event hub inputs (Event-A & Event-B) to azure stream analytics.
Input Event-A: Primary (whenever I am getting event from 'event-A' then I have to do joins with data from 'event-B' for last 20 minutes duration)
Input Event-B: Secondary (Its a kind of reference data but from other Azure EventHub)
Select a.id,b.id,b.action into outputevent from eventA a
left join eventB b on a.id = b.id -- don't know how to consider last 20 minutes event-B data
Need to match/join with Event-B only for last 20 minutes duration and don't know which window function applicable for this use case (And observed most of the window function is waiting for the future to actually trigger but my requirement is to play with past 'event-B' when I receive event-A)

I have formatted your own answer so it's more readable for others:
select
b.eventenqueuedutctime as btime,
a.Id,
a.SysTime,
a.UTCTime ,
b.Id as BId,
b.SysTime as BSysTime
into outputStorage -- to blob storage (container)
from
eventA a TIMESTAMP BY eventenqueuedutctime
left outer join
eventB b TIMESTAMP BY eventenqueuedutctime
on
a.id = b.id and
datediff(minute,b,a) between 0 and 180 -- join with last 3 hours of eventB data
You posted this answer in 2020. Is this query still in use today?

Related

Redshift deadlock on COPY command

I'm doing a simple COPY command that used to work:
echo " COPY table_name
FROM 's3://bucket/<date>/'
iam_role 'arn:aws:iam::123:role/copy-iam'
format as json 's3://bucket/jupath.json'
gzip ACCEPTINVCHARS ' ' TRUNCATECOLUMNS TRIMBLANKS MAXERROR 3;
" | psql
And now I get:
INFO: Load into table 'table_name' completed, 53465077 record(s) loaded successfully.
ERROR: deadlock detected
DETAIL: Process 26999 waits for AccessExclusiveLock on relation 3176337 of database 108036; blocked by process 26835.
Process 26835 waits for ShareLock on transaction 24230722; blocked by process 26999.
The only change is moving from dc2 instance type to ra3. Let me add this is the only command touches this table and there is only one process at a time.
The key detail here is in the error message:
Process 26999 waits for AccessExclusiveLock on relation 3176337 of
database 108036; blocked by process 26835. Process 26835 waits for
ShareLock on transaction 24230722; blocked by process 26999.
Relation 3176337, I assume, is the table in question - the target of the COPY. This should be confirmed by running something like:
select distinct(id) table_id
,trim(datname) db_name
,trim(nspname) schema_name
,trim(relname) table_name
from stv_tbl_perm
join pg_class on pg_class.oid = stv_tbl_perm.id
join pg_namespace on pg_namespace.oid = relnamespace
join pg_database on pg_database.oid = stv_tbl_perm.db_id
;
I don't expect any surprises here but it is good to check. If it is some different table (object) then this is important to know.
Now for the meat. You have 2 processes listed in the error message - PID 26999 and PID 26835. A process is a unique connection to the database or a session. So these are identifying the 2 connections to the database that have gotten locked with each other. So a good next step is to see what each of these sessions (processes or PIDs) are doing. Like this:
select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
from svl_statementtext
where pid in (26999, 26835)
--where xid = 16627013
and sequence < 320
--and starttime > getdate() - interval '24 hours'
group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
The thing you might run into is that these logging table "recycle" every few days so the data from this exact failure might be lost.
The next part of the error is about the open transaction that is preventing 26835 from moving forward. This transaction (identified by an XID, or transaction ID) is preventing 26835 progressing and is part of process 26999 but 26999 needs 26835 to complete some action before it a move - a deadlock. So seeing what is in this transaction will be helpful as well:
select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
from svl_statementtext
where xid = 24230722
and sequence < 320
--and starttime > getdate() - interval '24 hours'
group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
Again the data may have been lost due to time. I commented out the date range where clause of the last 2 queries to allow for looking back further in these tables. You should also be aware that PID and XID numbers are reused so check the date stamps on the results to be sure that that info from different sessions aren't be combined. You may need a new where clause to focus in on just the event you care about.
Now you should have all the info you need to see why this deadlock is happening. Use the timestamps of the statements to see the order in which statements are being issued by each session (process). Remember that every transaction ends with a COMMIT (or ROLLBACK) and this will change the XID of the following statements in the session. A simple fix might be issuing a COMMIT in the "26999" process flow to close that transaction and let the other process advance. However, you need to understand if such a commit will cause other issues.
If you can find all this info and if you need any help reach out.
Clearly a bug.
Table was cloned from one redshift to another by doing SHOW TABLE table_name, which provided:
CREATE TABLE table_name (
message character varying(50) ENCODE lzo,
version integer ENCODE az64,
id character varying(100) ENCODE lzo ,
access character varying(25) ENCODE lzo,
type character varying(25) ENCODE lzo,
product character varying(50) ENCODE lzo,
)
DISTSTYLE AUTO SORTKEY AUTO ;
After removing the "noise" the command completed as usual without errors:
DROP TABLE table_name;
CREATE TABLE table_name (
message character varying(50),
version integer,
id character varying(100),
access character varying(25),
type character varying(25),
product character varying(50),
);

Kinesis Analytics Session or Stagger Window Batching Without Aggregation

I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.
try LAST_VALUE("domain") as domain in the select clause.

Stream Analytics Aggregation Window

I need help \ advice on how to ignore old events when performing aggregation over an extended window. I have sale data that is streaming into Event Hub.
Event hub is used as as Input stream. I need to produce two metrices
- 30 sec aggregation ( tumbling )
- Whole day aggregated sales value i.e. from Gate open
Gate open time is variable (dynamic) hence I read reference dataset off the blob; and join the Gateopen datetime to sales stream.
The 30 sec aggregation over the tumbling window works fine.
Given the gate open is variable; I am currently using 12 hour Hopping window with 30 sec hop and trying to limit the event to be aggregated by using EventProcessDatetime > GateOpen logic.
SELECT
Dateadd(ss,-30,System.Timestamp ) AS TimeSliceUTCStart
, System.Timestamp AS TimeSliceUTCEnd
, p.Section AS Section
, SUM(CASE WHEN p.Classification = 'Retail'
AND p.ActivityDateTime > p.GateOpen THEN p.[sales_amt_gross] ELSE 0 END) AS SaleTotalRetail
FROM FilteredBase p
GROUP BY
p.Section
, HoppingWindow(Duration(Hour, 12), hop(second, 30),Offset(millisecond, -1))
Problem: I am getting sales aggregated from the previous day\timeslice.
Overall the outcome I am trying to achieve is simple. The store could be open for 5,8,10 or 12 hour max. We want to be able to know sales as in Live stream, for each section as the day progresses. Any advise or tip will be much appreciated.
Intuitively the query looks good, but what happens under the cover is that Azure Stream Analytics is using the reference data file that was valid at the time of each time window. Then, when it sees the even of the previous day, it will use the reference data present at that time (which may make the comparison p.ActivityDateTime > p.GateOpen True for the previous opening time).
I modified the query as followed (supposing you have 1 open event per day per section). Let me know if it works for you. If it doesn't, can you send some sample data so I can modify the query accordingly. We will investigate to see how to make these queries easier to write.
WITH thirdtysecReporting AS
(
SELECT
p.Section Section,
DATETIMEFROMPARTS(DATEPART(year, System.Timestamp), DATEPART(month, System.Timestamp), DATEPART(day, System.Timestamp), 0, 0, 0, 0) as date,
System.Timestamp Windowend,
SUM(p.sales_amt_gross) thirtysecSales
FROM input TIMESTAMP BY p.ActivityDateTime
GROUP BY TumblingWindow(second, 30), p.Section
)
,hopping AS
(
SELECT
Section,
System.Timestamp HopEnd,
date,
SUM(thirtysecSales) SumSales
FROM thirdtysecReporting
GROUP BY HoppingWindow(second, 86400, 30), Section, date -- Hopping on 24 hours, reported every 30 second
)
,filtered as -- This step ignores data from the previous day
(
SELECT
Section,
HopEnd,
date,
SUMQt = CASE
WHEN DAY(HopEnd) = DAY(date) OR DATEPART(hour, HopEnd) = DATEPART(hour, date) THEN SumSales
ELSE 0
END
FROM hopping
)
SELECT Section, -- Final query
HopEnd,
MAX(SUMQt) AS SumQt
FROM filtered
GROUP BY TumblingWindow(hour, 1), Section, hopend
Thanks,
JS - Azure Stream Analytics

Issue related to fully distributed deployment of siddhi query with join operation

I have set up a 2 node worker and 1 node manager siddhi cluster.
Following is the query I tried pushing into manager. Everything seems to work fine when there is no join in the query, but in case of join as mentioned below query gets deployed in worker node but events dont seem to be satisfied.
#app:name("rule_1")
#source(
type="kafka",
topic.list="test-input-topic",
group.id="test-group",
threading.option="single.thread",
bootstrap.servers="localhost:9092",
#Map(type="json"))
define stream TempStream (deviceID string,roomNo string,temp int );
#sink(
type="kafka",
topic="test-output-topic",
bootstrap.servers="localhost:9092",
#Map(type="json"))
define stream OutStream (message string, message1 string, message2 double);
#info(name = "query1")
#dist(execGroup="group1")
from TempStream[deviceID=="rule_1" and temp>10]#window.time(5 sec)
select avg(temp) as avgTemp, roomNo, deviceID
insert all events into AvgTempStream1;
#info(name = "query2")
#dist(execGroup="group2")
from TempStream[deviceID=="rule_1" and temp<10]#window.time(5 sec)
select avg(temp) as avgTemp, roomNo, deviceID
insert all events into AvgTempStream2;
#info(name = "query3")
#dist(execGroup="group3")
from AvgTempStream1#window.length(1) as stream1 join AvgTempStream2#window.length(1) as stream2
select stream1.deviceID as message,stream1.roomNo as message1, stream1.avgTemp as message2
having stream1.avgTemp>stream2.avgTemp
insert into outputStream;
#info(name = "query4")
#dist(execGroup="group4")
from AvgTempStream1[deviceID=="rule_1"]
select deviceID as message, roomNo as message1, avgTemp as message2
insert into OutStream;
Event being passed
{"event":{"deviceID":"rule_1","roomNo":"123","temp":12}}
According to above given Siddhi app you will need to send two input event for it to make an output. One event with temp>10 other temp<10. Ex:
{"event":{"deviceID":"rule_1","roomNo":"123","temp":12}}
{"event":{"deviceID":"rule_1","roomNo":"123","temp":8}}
This will make sure that a join will happen and event will be emitted. For troubleshooting purposes you can subscribe to intermidiatory Kafka topics using Kafka consumer. Names of intermediate topics will follow the format of SiddhiAppName.StreamName. Ex: rule_1.AvgTempStream1
Hope this helps!!
Thanks,
Tishan

Update a table by using multiple joins

I am using Java DB (Java DB is Oracle's supported version of Apache Derby and contains the same binaries as Apache Derby. source: http://www.oracle.com/technetwork/java/javadb/overview/faqs-jsp-156714.html#1q2).
I am trying to update a column in one table, however I need to join that table with 2 other tables within the same database to get accurate results (not my design, nor my choice).
Below are my three tables, ADSID is a key linking Vehicles and Customers and ADDRESS and ZIP in Salesresp are used to link it to Customers. (Other fields left out for the sake of brevity.)
Salesresp(address, zip, prevsale)
Customers(adsid, address, zipcode)
Vehicles(adsid, selldate)
The goal is to find customers in the SalesResp table that have previously purchased a vehicle before the given date. They are identified by address and adsid in Customers and Vechiles respectively.
I have seen updates to a column with a single join and in fact asked a question about one of my own update/joins here (UPDATE with INNER JOIN). But now I need to take it that one step further and use both tables to get all the information.
I can get a multi-JOIN SELECT statement to work:
SELECT * FROM salesresp
INNER JOIN customers ON (SALESRESP.ZIP = customers.ZIPCODE) AND
(SALESRESP.ADDRESS = customers.ADDRESS)
INNER JOIN vehicles ON (Vehicles.ADSId =Customers.ADSId )
WHERE (VEHICLES.SELLDATE<'2013-09-24');
However I cannot get a multi-JOIN UPDATE statement to work.
I have attempted to try the update like this:
UPDATE salesresp SET PREVSALE = (SELECT SALESRESP.address FROM SALESRESP
WHERE SALESRESP.address IN (SELECT customers.address FROM customers
WHERE customers.adsid IN (SELECT vehicles.adsid FROM vehicles
WHERE vehicles.SELLDATE < '2013-09-24')));
And I am given this error: "Error code 30000, SQL state 21000: Scalar subquery is only allowed to return a single row".
But if I change that first "=" to a "IN" it gives me a syntax error for having encountered "IN" (Error code 30000, SQL state 42X01).
I also attempted to do more blatant inner joins, but upon attempting to execute this code I got the the same error as above: "Error code 30000, SQL state 42X01" with it complaining about my use of the "FROM" keyword.
update salesresp set prevsale = vehicles.selldate
from salesresp sr
inner join vehicles v
on sr.prevsale = v.selldate
inner join customers c
on v.adsid = c.adsid
where v.selldate < '2013-09-24';
And in a different configuration:
update salesresp
inner join customer on salesresp.address = customer.address
inner join vehicles on customer.adsid = vehicles.ADSID
set salesresp.SELLDATE = vehicles.selldate where vehicles.selldate < '2013-09-24';
Where it finds the "INNER" distasteful: Error code 30000, SQL state 42X01: Syntax error: Encountered "inner" at line 3, column 1.
What do I need to do to get this multi-join update query to work? Or is it simply not possible with this database?
Any advice is appreciated.
If I were you I would:
1) Turn off autocommit (if you haven't already)
2) Craft a select/join which returns a set of columns that identifies the record you want to update E.g. select c1, c2, ... from A join B join C... WHERE ...
3) Issue the update. E.g. update salesrep SET CX = cx where C1 = c1 AND C2 = c2 AND...
(Having an index on C1, C2, ... will boost performance)
4) Commit.
That way you don't have worry about mixing the update and the join, and doing it within a txn ensures that nothing can change the result of the join before your update goes through.