I have a simple query in ASA from an IoT Hub input to send an average calculation each second to powerbi. I can see that the first data comes to PowerBi 15-20 seconds after IoT Hub receives the input.
Is there anything I can do to decrease this delay?
Query:
SELECT AVG(CAST(acctotal as float)) as average_shake,
CAST(MAX(eventTime) as datetime) as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY eventTime
GROUP BY TumblingWindow(second, 1)
Event Ordering settings are kept to default values
Late arrival Days:00, Hours:00, Minutes:00, Seconds:05
Out of order Minutes:00, Seconds:00
Action: Adjust
If you use the system timestamp instead of event time, I think you will see the delay go away. Try just removing the line "TIMESTAMP BY eventTime"
You can get system time - i.e. the timestamp given to the event as it flows through ASA - through:
SELECT System.Timestamp
As documented in MSDN.
Building onto Josh's response: perhaps you could try something like:
SELECT AVG(CAST(acctotal as float)) as average_shake,
System.Timestamp as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY time
GROUP BY TumblingWindow(second, 1)
What is the volume of your input events and what is the number of IoTHub partitions? ASA merges data form IOTHub partitions and arranges events by time to compute aggregation defined in the query. If you have many partitions and relatively small number of events, there could be additional delays as some IoTHub partitions may not have data and ASA will be waiting for the data to appear (max delay is controlled by late arrival policy).
If this is the case, you may want to use fewer IoTHub partitions.
In general, you will see smaller latency in ASA when you process partitions in parallel (use PARTITION BY clause). The drawback is that you will end up with partial aggregate values per partition. You can probably aggregate them further in PowerBI.
Related
I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs
I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.
I am working with Kinesis Analytics and I am trying to understand how to write my application to give me a sliding window over 24 hours. What I have generates the right data, but it looks like it regenerates it every time, which might be what it's supposed to do and my own ignorance prevents me from looking at the problem right?
What I want to do:
I have a few devices that feed a Kinesis Stream, which this Kinesis analytics application is hooked up to.
Now, when a record comes in, what I want to do is SUM a value over the last 24 hours and store that. So after Kinesis Analytics does it's job I'm connecting it to a Lambda to finalize some things.
My issue is, when I simulate sending in some data, 5 records in this case, everything runs, it runs multiple times, not 5. It LOOKS like each time a record comes in it redoes everything in the window (expected) which triggers the lambda for each row that's emitted. As the table grows, it's bad news. What I really want is just the latest value from the window from NOW - 24 HOUR, with the "id" field so I can join that "id" back to a record stored elsewhere.
My Application looks like this:
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM "id",
"timestamp_mark",
"device_id",
"x_id",
"y_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
);
Hmmm.. this might be a better idea, Do the aggregation in a sub-select and select from that. It looks like I need that second window (W2 below) to ensure I get each record that was given back out.
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM s."id",
s."timestamp_mark",
s."device_id",
s."property_a_id",
s."property_b_id",
v."value"
FROM "SOURCE_SQL_STREAM_001" OVER W2 AS s, (
SELECT STREAM "SOURCE_SQL_STREAM_001"."ROWTIME", "id",
"timestamp_mark",
"device_id",
"property_a_id",
"property_b_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
)
) AS v
WHERE s."id" = v."id"
WINDOW W2 AS (
RANGE INTERVAL '1' SECOND PRECEDING
);
Also I notice that if I restart the Kinesis Analytics application, the SUM values reset, so clearly it doesn't persist across restarts, which might make it unsuitable for this solution. I might have to just setup a SQL server and periodically delete old records.
In general using Streaming Analytics solutions (and Kinesis Analytics in particular) is recommended when you need to do something based on the data in the events and not something external like wall clock time.
The reason is simple: if you need to do something once every 24h, you create a job bringing the data from storage (DB) once, performing your task and then "going to sleep" for another 24h - no complexities, manageable overhead. Now if you need to do something based on the data (e.g. when SUM of some field across multiple events exceeds X) you are in trouble with conventional solution since there is no simple criteria for when it should run. If you run it periodically, it might be invoked many times until the data driven criteria is met, creating a clear overhead.
In the latest case Streaming Analytics solution will be used as designed and trigger your logic just when needed, minimizing the overhead.
If you prefer using Streaming Analytics (which I personally don't recommend based on description of your problem), but struggling with Kinesis Analytics syntax, you might consider using Drools Kinesis Analytics. Among its features are crons and collectors, which provide you with very simple way to trigger jobs on time basis.
Note, that my answer is biased since I'm a CTO at Streamx.
Requirement is kick off dag based on data availability from upstream/dependent tables
While condition check data availability (in the tables at Big query for n number of iteration) to check data available or not. If data available then kick off subdag/task else continue in loop.
It would be great to see an clear example how to use BigQueryOperator or `BigQueryValueCheckOperator' and then execute big query something like this
{Code}
SELECT
1
FROM
WHERE
datetime BETWEEN TIMESTAMP(CURRENT_DATE())
AND TIMESTAMP(DATE_ADD(CURRENT_DATE(),1,'day'))
LIMIT
1
{Code}
If query output is 1 (that means data available for today's load) then kick off dag else continue in loop as shown in attached diagram link.
Does anyone had setup such design in Airflow dag.
You may check the BaseSensorOperator and BigQueryTableSensor to implement your own Sensor for it. https://airflow.incubator.apache.org/_modules/airflow/operators/sensors.html
Sensor operators keep executing at a time interval and succeed when a
criteria is met and fail if and when they time out.
BigQueryTableSensor just checks whether table exists or not but did check the data in the table. It might be something like this:
task1>>YourSensor>>task2
I have a project that uses an event hub to receive data, this is sent every second, the data is received by a website using SignalR, this is all working fine, i have been storing the data in to blob storage via a Stream Analytics Job, but this is really slow to access, and with the amount of data i am receiving off just 6 devices, it will get even slower as this increases, i need to access the data to display historical data on via graphs on the website, and then this is topped up with the live data coming in.
I don't really need to store the data every second, so thought about only storing it every 30 seconds instead, but into a SQL DB, what i am trying to do, is still receive the data every second but only store it every 30, i have tried a tumbling window, but from what i can see, this just dumps everything every 30 seconds instead of the single entries.
am i miss understanding the Tumbling, Sliding and Hopping windows, i am guessing i cannot use them in this way ? if that is the case, i am guessing the only way to do it, would be to have the output db as an input, so i can cross reference the timestamp with the current time ?
unless anyone has any other ideas ? any help would be appreciated.
Thanks
am i miss understanding the Tumbling, Sliding and Hopping windows
You are correct that this will put all events within the Tumbling/Sliding/Hopping window together. However, this is only valid within a group by case, which requires a aggregate function over this group.
There is a aggregate function Collect() which will create an array of the events within a group.
I think this should be possible when you group every event within a 30 second tumbling window using Collect(), then in the next step, CROSS APPLY each record, which should output all received events within the 30 seconds.
With Grouper AS (
SELECT Collect() AS records
FROM Input TIMESTAMP BY time
GROUP BY TumblingWindow(second, 30)
)
SELECT
record.ArrayValue.FieldA AS FieldA,
record.ArrayValue.FieldB AS FieldB
INTO Output
FROM Grouper
CROSS APPLY GetArrayElements(Grouper.records) AS record
If you are trying to aggregate 30 entries into one summary row every 30 seconds then a tumbling window is a good choice. Something like the following should work:
SELECT System.TimeStamp AS OutTime, TollId, COUNT(*) as cnt, sum(TollCharge) as TollCharge
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 30)
Thanks for the response, I have been speaking to my contact at Microsoft and he suggested something similar, I had also found something like that in various examples online. what I actually want to do, is only update the database with the data every 30 seconds. so I will receive the event, store it, and I will not store it again until 30 seconds have passed. I am not sure how I can do it with and ASA job to be honest, as I need to have a record of the last time it was updated, I actually have a connection to the event hub from my web site, so in the receiver, I am going to perform a simple check, and then store the data from there.