Kinesis Analytics Session or Stagger Window Batching Without Aggregation - amazon-web-services

I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.

try LAST_VALUE("domain") as domain in the select clause.

Related

Stuck with kinesis stagger window

I have a kinesis analytics application set up which takes data from a kinesis stream which has the following schema.
--------------------------
Column ColumnType
--------------------------
Level varchar(10)
RootID varchar(32)
ProcessID varchar(16)
EntityName varchar(64)
Message varchar(512)
Threshold varchar(32)
TriggerTime timestamp
My objective is to create a realtime kinesis analytics solution which segregates the records with level "OVERFLOW" and
groups them based on the RootID. All records belonging to a RootID are ideally expected to reach kinesis within a span
of 5 minutes. So I am thinking of setting up a stagger window for this and so far I have come up with this SQL.
CREATE OR REPLACE STREAM "OVERFLOW_SQL_STREAM" (
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
);
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
I received an error in the SQL stating that "PARTITION BY clause doesn't have the column 'Level'". I don't understand why
should I add that column to partition as I want my records to be partitioned by only by the RootID column and not by any other.
Adding that column throws error saying that I should add the next column and so on. I couldn't understand the error.
Kindly help me!Thanks!
There is a workaround for this type of problem.
You can Use FIRST_VALUE() or LAST_VALUE() to cast the result instead of directly passing them.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
LAST_VALUE("Level") AS Level,
"RootID" varchar (32),
....
....
....
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
This is the way you can create the stream pump without adding into the PARTITION BY clause.
FIRST_VALUE() -- To get the very first value of level matched by the
stream partition (Here RootID)
LAST_VALUE() -- Viceversa

How to do calculations/ comparisons on 2 windows?

I'm collecting events with different id, there are n type of fixed ids in incoming events. I want to collect average of past events based on time-frame or no. of events between different type of ids.
Let's say, there are 2 devices sending data/ event with id 'a' and 'b'. I want to get average of past 5 minutes of data for both devices and then compare both averages to make some decision.
By this code, I'm collecting data of past n minutes of data and storing in 2 windows.
`
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
define window hold_a(avg_amount double) length(1);
define window hold_b(avg_amount double) length(1);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_a;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_b;`
window hold_a and hold_b will get average data of past 5 min. Now I want to compare data from both windows and take decision.
I've tried join on both windows but join query doesn't get executed.
You have to use a pattern to achieve this. Below query with output the name which had the highest average into highestAvgStream.
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;`
from every(e1=avgStream -> e2=avgStream)
select ifthenelse(e1.avg_amount>e2.avg_amount,e1.name,e2.name) as highestAvgName
insert into HighestAvgStream;

Stream Analytics Aggregation Window

I need help \ advice on how to ignore old events when performing aggregation over an extended window. I have sale data that is streaming into Event Hub.
Event hub is used as as Input stream. I need to produce two metrices
- 30 sec aggregation ( tumbling )
- Whole day aggregated sales value i.e. from Gate open
Gate open time is variable (dynamic) hence I read reference dataset off the blob; and join the Gateopen datetime to sales stream.
The 30 sec aggregation over the tumbling window works fine.
Given the gate open is variable; I am currently using 12 hour Hopping window with 30 sec hop and trying to limit the event to be aggregated by using EventProcessDatetime > GateOpen logic.
SELECT
Dateadd(ss,-30,System.Timestamp ) AS TimeSliceUTCStart
, System.Timestamp AS TimeSliceUTCEnd
, p.Section AS Section
, SUM(CASE WHEN p.Classification = 'Retail'
AND p.ActivityDateTime > p.GateOpen THEN p.[sales_amt_gross] ELSE 0 END) AS SaleTotalRetail
FROM FilteredBase p
GROUP BY
p.Section
, HoppingWindow(Duration(Hour, 12), hop(second, 30),Offset(millisecond, -1))
Problem: I am getting sales aggregated from the previous day\timeslice.
Overall the outcome I am trying to achieve is simple. The store could be open for 5,8,10 or 12 hour max. We want to be able to know sales as in Live stream, for each section as the day progresses. Any advise or tip will be much appreciated.
Intuitively the query looks good, but what happens under the cover is that Azure Stream Analytics is using the reference data file that was valid at the time of each time window. Then, when it sees the even of the previous day, it will use the reference data present at that time (which may make the comparison p.ActivityDateTime > p.GateOpen True for the previous opening time).
I modified the query as followed (supposing you have 1 open event per day per section). Let me know if it works for you. If it doesn't, can you send some sample data so I can modify the query accordingly. We will investigate to see how to make these queries easier to write.
WITH thirdtysecReporting AS
(
SELECT
p.Section Section,
DATETIMEFROMPARTS(DATEPART(year, System.Timestamp), DATEPART(month, System.Timestamp), DATEPART(day, System.Timestamp), 0, 0, 0, 0) as date,
System.Timestamp Windowend,
SUM(p.sales_amt_gross) thirtysecSales
FROM input TIMESTAMP BY p.ActivityDateTime
GROUP BY TumblingWindow(second, 30), p.Section
)
,hopping AS
(
SELECT
Section,
System.Timestamp HopEnd,
date,
SUM(thirtysecSales) SumSales
FROM thirdtysecReporting
GROUP BY HoppingWindow(second, 86400, 30), Section, date -- Hopping on 24 hours, reported every 30 second
)
,filtered as -- This step ignores data from the previous day
(
SELECT
Section,
HopEnd,
date,
SUMQt = CASE
WHEN DAY(HopEnd) = DAY(date) OR DATEPART(hour, HopEnd) = DATEPART(hour, date) THEN SumSales
ELSE 0
END
FROM hopping
)
SELECT Section, -- Final query
HopEnd,
MAX(SUMQt) AS SumQt
FROM filtered
GROUP BY TumblingWindow(hour, 1), Section, hopend
Thanks,
JS - Azure Stream Analytics

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

How to update a stream with the response from another stream where the sink type is "http-response"

Am trying to enrich my input stream with an additional attribute which gets populated via "http-response" response sink.
I have tried using the "join" with window attribute and with "every" keyword to merge two streams and inserting the resulting merged stream into another stream to enrich it.
The window attributes (window.time(1 sec) or window.length(1)) and "every" keyword works well when the incoming events are coming at a regular interval of 1 sec or more.
When (say for example 10 or 100) events are sent at the same time(within a second). Then the result of the merge is not in expected terms.
The one with "window" attribute (join)
**
from EventInputStreamOne#window.time(1 sec) as i
join EventInputStreamTwo as s
on i.variable2 == s.variable2
select i.variable1 as variable1, i.variable2 as variable2, s.variable2 as variable2
insert into EventOutputStream;
**
The one with the "every" keyword
**
from every e1=EventInputStream,e2=EventResponseStream
select e1.variable1 as variable1, e1.variable2 as variable2, e2.variable3 as variable3
insert into EventOutputStream;
**
Is there any better way to merge the two streams in order to update a third stream?
To get the original request attributes, you can use custom mapping as follows,
#source(type='http-call-response', sink.id='source-1'
#map(type='json',#attributes(name='name', id='id', volume='trp:volume', price='trp:price')))
define stream responseStream(name String, id int, headers String, volume long, price float);
Here, the request attributes can be accessed with trp:attributeName, in this sample only name is from the response, price and volume is from the request.
The syntax in your 'every' keyword approach isn't quite right. Have you tried something like this:
from every (e1 = event1) -> e2=event2[e1.variable == e2.variable]
select e1.variable1, e2.variable1, e2.variable2
insert into outputEvent;
This document might help.