I need help \ advice on how to ignore old events when performing aggregation over an extended window. I have sale data that is streaming into Event Hub.
Event hub is used as as Input stream. I need to produce two metrices
- 30 sec aggregation ( tumbling )
- Whole day aggregated sales value i.e. from Gate open
Gate open time is variable (dynamic) hence I read reference dataset off the blob; and join the Gateopen datetime to sales stream.
The 30 sec aggregation over the tumbling window works fine.
Given the gate open is variable; I am currently using 12 hour Hopping window with 30 sec hop and trying to limit the event to be aggregated by using EventProcessDatetime > GateOpen logic.
SELECT
Dateadd(ss,-30,System.Timestamp ) AS TimeSliceUTCStart
, System.Timestamp AS TimeSliceUTCEnd
, p.Section AS Section
, SUM(CASE WHEN p.Classification = 'Retail'
AND p.ActivityDateTime > p.GateOpen THEN p.[sales_amt_gross] ELSE 0 END) AS SaleTotalRetail
FROM FilteredBase p
GROUP BY
p.Section
, HoppingWindow(Duration(Hour, 12), hop(second, 30),Offset(millisecond, -1))
Problem: I am getting sales aggregated from the previous day\timeslice.
Overall the outcome I am trying to achieve is simple. The store could be open for 5,8,10 or 12 hour max. We want to be able to know sales as in Live stream, for each section as the day progresses. Any advise or tip will be much appreciated.
Intuitively the query looks good, but what happens under the cover is that Azure Stream Analytics is using the reference data file that was valid at the time of each time window. Then, when it sees the even of the previous day, it will use the reference data present at that time (which may make the comparison p.ActivityDateTime > p.GateOpen True for the previous opening time).
I modified the query as followed (supposing you have 1 open event per day per section). Let me know if it works for you. If it doesn't, can you send some sample data so I can modify the query accordingly. We will investigate to see how to make these queries easier to write.
WITH thirdtysecReporting AS
(
SELECT
p.Section Section,
DATETIMEFROMPARTS(DATEPART(year, System.Timestamp), DATEPART(month, System.Timestamp), DATEPART(day, System.Timestamp), 0, 0, 0, 0) as date,
System.Timestamp Windowend,
SUM(p.sales_amt_gross) thirtysecSales
FROM input TIMESTAMP BY p.ActivityDateTime
GROUP BY TumblingWindow(second, 30), p.Section
)
,hopping AS
(
SELECT
Section,
System.Timestamp HopEnd,
date,
SUM(thirtysecSales) SumSales
FROM thirdtysecReporting
GROUP BY HoppingWindow(second, 86400, 30), Section, date -- Hopping on 24 hours, reported every 30 second
)
,filtered as -- This step ignores data from the previous day
(
SELECT
Section,
HopEnd,
date,
SUMQt = CASE
WHEN DAY(HopEnd) = DAY(date) OR DATEPART(hour, HopEnd) = DATEPART(hour, date) THEN SumSales
ELSE 0
END
FROM hopping
)
SELECT Section, -- Final query
HopEnd,
MAX(SUMQt) AS SumQt
FROM filtered
GROUP BY TumblingWindow(hour, 1), Section, hopend
Thanks,
JS - Azure Stream Analytics
Related
I just started working with Bigquery. Data comes from firebase and I noticed that I got the data each day, for example gara-e78a5.analytics_247657392.events_20221231, gara-e78a5.analytics_247657392.events_20221230 etc....
Each row comes with an event_date like under this format 20221231
I want to count the number of people landing on our page each week, but I don't know I to group them by week.
I started something like this but I don't know to group it by week:
SELECT count(event_name) FROM app-xxxx.analytics_247657392.events_* where event_name = 'page_download_view' group by
Thanks in advance for your help
Based on #Ronak, i found the solution.
SELECT week_of_year, sum(nb_download) as nb_download_per_week from (
SELECT DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year, count(event_name) as nb_download from `tabllle-e78a5.analytics_XXXXX.events_*` where event_name = 'landing_event_download_apk' group by event_date) group by week_of_year
You can use the WEEK (or ISOWEEK) function.
WEEK: Returns the week number of the date in the range [0, 53]
More: https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions
Formats - https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#format_timestamp
This should work
select EXTRACT(ISOWEEK FROM(CAST(PARSE_DATE('%Y%m%d', <column>) as TIMESTAMP))) as week_of_year from <table>
Output
I am trying to create a plot which normalizes the x axis based on the date of an event listed in a different query. I have the production data sitting in one query, which contains the value per day. I have a list of events in a second query, which have specific dates at which they occured. My goal is to create some measure or function that will plot each event on the axis in a normalized fashion where the production -1y to 1y from the event is shown on the x axis. This will allow a comparison of the last year of production before the event versus the next year, which can gauge success of the event. I don't know how to successfully do this though, and would appreciate any insight.
Current plot with event dates in table on the right
A person who did something similar merged the two queries and created a column which finds the date of the event, and determines how far in time it has been since (30d since the event is given 30). However, because I have about 70 data points which each have 4-5 of their own events, this merged query duplicated the production for each event, so there is 4-5 logs of production data per data point, which has become hard to manage as I try to do something similar and understand what they're doing. I believe there's a better way to do this but I can't figure out how to connect the two queries in a more efficient way.
Normalized production from day 0 of each event that I'm trying to create
My code below is my attempt at making a measure that will create this normalized axis for me, but I am getting a circular dependency error.
Normalized Producing Days From Workover =
//These will reference the date values in the measures made of the workover start and end dates
Var startofworkover = DATEVALUE([Measure: WKO Start Date])
var endofworkover = DATEVALUE([Measure: WKO End Date])
//this will filter to dates after the workover
var afterworkover = FILTER('Production Data',
'Production Data'[Date]>=endofworkover && 'Production Data'[Well Pair] = EARLIER('Production Data'[Well Pair]) && [Allocated Oil Production (m3/d)] >0)
//this will filter to dates before the workover
var beforeworkover = FILTER('Production Data',
'Production Data'[Date]<startofworkover && 'Production Data'[Well Pair] = EARLIER('Production Data'[Well Pair]) && [Allocated Oil Production (m3/d)] >0)
var result1 = SWITCH(TRUE(),'Production Data'[Date] = startofworkover,0,'Production Data'[Date]>=endofworkover,RANKX(afterworkover,'Production Data'[Date],,ASC,Dense),'Production Data'[Date]<startofworkover,RANKX(beforeworkover,'Production Data'[Date],,DESC,Dense)*-1)
return result1
I work in production where we measure the time use of different machines. Basically I want to show my colleagues with a bar chart in PowerBI when most people start using the machine and when most people are done with the machine (showed in full hours, for example 7, 10, 16).
I have 2 column (start time & end time) that are in time format (16:30:00) that I've changed to whole numbers (Start time as number) and a end time as number that can't be seen in picture due to broad sheet. See below
Formula: Start time as number = HOUR(sheet1[Start time])
The problem I have occurred is when I take start time and end time in same table it shows exactly the same values? But if I make a table with start and a different with end, it shows the correct result. See below pictures:
Above is the merge of Start and End time but it should look like below
The left table above is start time and right table is end time of machine.
Thanks!
EDIT: When I try formula from Mik
EDIT Picture of my situation.
I think it's correct now! Will try on my main data.
Create a table for Ox scale with
Hours = GENERATESERIES(0,23,1)
This is a measure for start ( for end is the same just change the column name)
start =
COUNTROWS(
FILTER(
'sheet1'
,HOUR(sheet1[Start time])
=SELECTEDVALUE(Hours[Value])
)
)
end =
COUNTROWS(
FILTER(
'sheet1'
,HOUR(sheet1[End time])
=SELECTEDVALUE(Hours[Value])
)
)
I need to calculate the time spent from one stop to another.
If the trip takes place in the same zipcode area and greater than one hour, I should report it.
For example,
The attached image with the dataframe shows the instance when arrivedTime from one stop to other in same zipcode has difference greater than one hour. I signal it. If it is less than one hour, no action needed. It should be grouped by and order by RuteID, Sequence and arrivedTime. In the attached example, the trip from zipCode 2300 to 2300 takes less then one hour it is ok. But If it is greater than one hour I should report by new column true/false for the number greater than one hour.
I would do this in power query using these steps:
Add a new column "NextSequence" = [Sequence]+1
Do a nested join with the table itsself, linking {"RoutID", "Sequence"} with {"RoutID", "NextSequence"} and when expanding, only keep the departedTime and ZipCode olumns calling them something like "prevDepTime" and "PrevZip"
write an IF-statement saying that if [PrevZip] <> null and [PrevZip] = [ZipCode] and [ArrivedTime]-[PrevDepTime] >= #time(1,0,0) then "Late" else "OK"
Clean up the table, keeping only the columns you want.
Then when loading the new table into power bi I have something that looks like this (I added a ew stops on RoutID2 for testing):
My M-code:
AddNextSequence = Table.AddColumn(PREVIOUS_STEP_NAME, "NextSequence", each [Sequence]+1, Int64.Type),
NestedJoin = Table.NestedJoin(AddNextSequence, {"RouteID", "Sequence"}, AddNextSequence, {"RouteID", "NextSequence"}, "AddedTable", JoinKind.LeftOuter),
ExpandTable = Table.ExpandTableColumn(NestedJoin, "AddedTable", {"DepartedTime", "ZipCode"}, {"PrevDepTime", "PrevZip"}),
Add_OkLate = Table.AddColumn(Expandedtable, "OK_or_Late", each if [PrevZip] <> null and [PrevZip] = [ZipCode] and Number.From([ArriveedTime]) - Number.From([PrevDepTime]) > Number.From(#time(1,0,0) ) then "Late"
else
"OK", type text),
FinalizeTable = Table.SelectColumns(AddOkLate,{"RowNumber", "RouteID", "Sequence", "ArriveedTime", "DepartedTime", "ZipCode", "OK_or_Late"})
in
FinalizeTable
I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.
try LAST_VALUE("domain") as domain in the select clause.