I 'm trying to do a simple query but a two time windows ; the query would be something like trying to send a message to users who have visited a product of the web more than twice in the last four months, provided that you have already sent the last month advertising for this product.
define stream webvisit (idClient string, idProduct string, chanel string)
from webvisit select idCliente, idProducto, canal,sum(1) as visits group by idCliente insert into visits
from visits[idProduct=='Fondos' and visits > 2]#window.time(4) insert into alert
and will continue ?
You can do something like following:
define stream webvisit (idClient string, idProduct string, chanel string)
from visits[productId =='Fondos’]#window.time(4 days)
select idClient, idProduct, chanel, count(idClient) as visitCount
group by idClient
insert into visits;
from visits[visitCount > 2]
select *
insert into resultStream;
In the second query we get the visit counts for each client during last 4 days and in the last query we filter those results with count > 2.
EDIT:
Since you need to send a notification only if it's not been sent within the last day (assuming it's defined as: current time - 24 hours ), you can try following:
define stream webvisit (idClient string, idProduct string, chanel string);
from webvisit[idProduct == 'Fondos']#window.time(4 days)
select idClient, idProduct, chanel, count(idClient) as visitCount
group by idClient insert into visits for current-events;
from visits[visitCount > 2]#window.time(1 day)
select idClient, idProduct, chanel, count(idClient) as hitsForClientPerDay
insert into tempStream;
from tempStream[hitsForClientPerDay < 2]
select idClient, idProduct, chanel, 'your custom message here' as advertisement
insert into advertisementStream;
The second (1 day window) query keeps track of how many alerts ('hitsForClientPerDay') have been generated in the last 24 hours, the last query sends out the advertisement only if there hasn't been any during that period (note that hitsForClientPerDay will be 1 when the event comes since the current event is also considered for count(), so we check it as < 2 ).
Related
I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.
try LAST_VALUE("domain") as domain in the select clause.
Recently we observed that when a user tries to complete a transaction on our website using an ios device. Apple ends the current session and begins a new session. The difficulty with this is that if the user came through paid source/email the current session ends and starts a new session with apple.com traffic source.
For Instance
google->appleid.apple.com
(direct)->appleid.apple.com
email->appleid.apple.com
ios->appleid.apple.com->appleid.apple.com->appleid.apple.com
Since we have this raw data coming into BQ we are looking at replacing appleid.apple.com with their actual traffic Source i.e. google,direct,email,ios.
Any help regarding the logic/function to workaround this problem will help?
This is the code I tried implementing:
WITH DATA AS (
SELECT
PARSE_DATE("%Y%m%d",date) AS Date,
clientId as ClientId,
fullVisitorId AS fullvisitorid,
visitNumber AS visitnumber,
trafficSource.medium as medium,
CONCAT(fullvisitorid,"-",CAST(visitStartTime AS STRING)) AS Session_ID,
trafficsource.source AS Traffic_Source,
MAX((CASE WHEN (hits.eventInfo.eventLabel="complete") THEN 1 ELSE 0 END)) AS ConversionComplete
FROM `project.dataset.ga_sessions_20*`
,UNNEST(hits) AS hits
WHERE totals.visits=1
GROUP BY
1,2,3,4,5,6,7
),
Source_Replace AS (
SELECT
Date AS Date,
IF(Traffic_Source LIKE "%apple.com" ,(CASE WHEN Traffic_Source NOT LIKE "%apple.com%" THEN LAG(Traffic_Source,1) OVER (PARTITION BY ClientId ORDER BY visitnumber ASC)end), Traffic_Source) AS traffic_source_1,
medium AS Medium,
fullvisitorid AS User_ID,
Session_ID AS SessionID,
ConversionComplete AS ConversionComplete
FROM
DATA
)
SELECT
Date AS Date,
traffic_source_1 AS TrafficSource,
Medium AS TrafficMedium,
COUNT(DISTINCT User_ID) AS Users,
COUNT(DISTINCT SessionID) AS Sessions,
SUM(ConversionComplete) AS ConversionComplete
FROM
Source_Replace
GROUP BY
1,2,3
Thanks
Does assuming the visitStartTime as key to identifying the session start help? Maybe something like:
source_replaced as (
select *,
min(Traffic_Source) over (
partition by date, clientid, fullvisitorid, visitnumber order by visitStartTime
) as originating_source
from data
)
Then you can do your aggregation over the originating_source. Its kind of difficult without looking at some sample of data about whats going on.
Hope it helps.
I'm collecting events with different id, there are n type of fixed ids in incoming events. I want to collect average of past events based on time-frame or no. of events between different type of ids.
Let's say, there are 2 devices sending data/ event with id 'a' and 'b'. I want to get average of past 5 minutes of data for both devices and then compare both averages to make some decision.
By this code, I'm collecting data of past n minutes of data and storing in 2 windows.
`
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
define window hold_a(avg_amount double) length(1);
define window hold_b(avg_amount double) length(1);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_a;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount
group by name
insert into hold_b;`
window hold_a and hold_b will get average data of past 5 min. Now I want to compare data from both windows and take decision.
I've tried join on both windows but join query doesn't get executed.
You have to use a pattern to achieve this. Below query with output the name which had the highest average into highestAvgStream.
#source(type='http', receiver.url='http://localhost:5007/SweetProductionEP', #map(type = 'json'))
define stream InProduction(name string, amount int);
from InProduction[name=='a']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;
from InProduction[name=='b']#window.timeBatch(5 min)
select avg(amount) as avg_amount, name
insert into avgStream;`
from every(e1=avgStream -> e2=avgStream)
select ifthenelse(e1.avg_amount>e2.avg_amount,e1.name,e2.name) as highestAvgName
insert into HighestAvgStream;
I need help \ advice on how to ignore old events when performing aggregation over an extended window. I have sale data that is streaming into Event Hub.
Event hub is used as as Input stream. I need to produce two metrices
- 30 sec aggregation ( tumbling )
- Whole day aggregated sales value i.e. from Gate open
Gate open time is variable (dynamic) hence I read reference dataset off the blob; and join the Gateopen datetime to sales stream.
The 30 sec aggregation over the tumbling window works fine.
Given the gate open is variable; I am currently using 12 hour Hopping window with 30 sec hop and trying to limit the event to be aggregated by using EventProcessDatetime > GateOpen logic.
SELECT
Dateadd(ss,-30,System.Timestamp ) AS TimeSliceUTCStart
, System.Timestamp AS TimeSliceUTCEnd
, p.Section AS Section
, SUM(CASE WHEN p.Classification = 'Retail'
AND p.ActivityDateTime > p.GateOpen THEN p.[sales_amt_gross] ELSE 0 END) AS SaleTotalRetail
FROM FilteredBase p
GROUP BY
p.Section
, HoppingWindow(Duration(Hour, 12), hop(second, 30),Offset(millisecond, -1))
Problem: I am getting sales aggregated from the previous day\timeslice.
Overall the outcome I am trying to achieve is simple. The store could be open for 5,8,10 or 12 hour max. We want to be able to know sales as in Live stream, for each section as the day progresses. Any advise or tip will be much appreciated.
Intuitively the query looks good, but what happens under the cover is that Azure Stream Analytics is using the reference data file that was valid at the time of each time window. Then, when it sees the even of the previous day, it will use the reference data present at that time (which may make the comparison p.ActivityDateTime > p.GateOpen True for the previous opening time).
I modified the query as followed (supposing you have 1 open event per day per section). Let me know if it works for you. If it doesn't, can you send some sample data so I can modify the query accordingly. We will investigate to see how to make these queries easier to write.
WITH thirdtysecReporting AS
(
SELECT
p.Section Section,
DATETIMEFROMPARTS(DATEPART(year, System.Timestamp), DATEPART(month, System.Timestamp), DATEPART(day, System.Timestamp), 0, 0, 0, 0) as date,
System.Timestamp Windowend,
SUM(p.sales_amt_gross) thirtysecSales
FROM input TIMESTAMP BY p.ActivityDateTime
GROUP BY TumblingWindow(second, 30), p.Section
)
,hopping AS
(
SELECT
Section,
System.Timestamp HopEnd,
date,
SUM(thirtysecSales) SumSales
FROM thirdtysecReporting
GROUP BY HoppingWindow(second, 86400, 30), Section, date -- Hopping on 24 hours, reported every 30 second
)
,filtered as -- This step ignores data from the previous day
(
SELECT
Section,
HopEnd,
date,
SUMQt = CASE
WHEN DAY(HopEnd) = DAY(date) OR DATEPART(hour, HopEnd) = DATEPART(hour, date) THEN SumSales
ELSE 0
END
FROM hopping
)
SELECT Section, -- Final query
HopEnd,
MAX(SUMQt) AS SumQt
FROM filtered
GROUP BY TumblingWindow(hour, 1), Section, hopend
Thanks,
JS - Azure Stream Analytics
I am trying to analyze what are the most popular hashtags of July. So far I am able to select tweets from July, or display the most popular tweets, but I didn't sucess in putting them together. I am thinking about creating a intermediate table with july tweets, then display the popular hashtags, but I don't know how, can you help me? What about a 2 level select (select a from select b from table) ?
SELECT hashtags.text, count(*) as total FROM tweets
WHERE regexp_extract(created_at, "(Tue) (Jul)*", 2) = "Jul"
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text), created_at
ORDER BY total_count DESC
LIMIT 200
Regards, K.
So far, I did this, which is pretty much what I want, but is there any mean to achieve this differently ?
Working nested query:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
EDIT:
Ok, so if you want you can also do it by a temporary table:
CREATE TABLE tmpdb (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Then you update it:
INSERT OVERWRITE TABLE tmpdb
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
And the request become as simple as this:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tmpdb
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
The pro/cons about the second method:
You need to update the table if you want accurate requests, so it is not suited for one-shot request, but if you need to do multiple requests on the current state of the database, then this method is better.
Don't forget that, copying a database is a costly operation ! So know when to use it :)