I am working with Kinesis Analytics and I am trying to understand how to write my application to give me a sliding window over 24 hours. What I have generates the right data, but it looks like it regenerates it every time, which might be what it's supposed to do and my own ignorance prevents me from looking at the problem right?
What I want to do:
I have a few devices that feed a Kinesis Stream, which this Kinesis analytics application is hooked up to.
Now, when a record comes in, what I want to do is SUM a value over the last 24 hours and store that. So after Kinesis Analytics does it's job I'm connecting it to a Lambda to finalize some things.
My issue is, when I simulate sending in some data, 5 records in this case, everything runs, it runs multiple times, not 5. It LOOKS like each time a record comes in it redoes everything in the window (expected) which triggers the lambda for each row that's emitted. As the table grows, it's bad news. What I really want is just the latest value from the window from NOW - 24 HOUR, with the "id" field so I can join that "id" back to a record stored elsewhere.
My Application looks like this:
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM "id",
"timestamp_mark",
"device_id",
"x_id",
"y_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
);
Hmmm.. this might be a better idea, Do the aggregation in a sub-select and select from that. It looks like I need that second window (W2 below) to ensure I get each record that was given back out.
CREATE OR REPLACE STREAM "DEVICE_STREAM" (
"id" VARCHAR(64),
"timestamp_mark" TIMESTAMP,
"device_id" VARCHAR(64),
"property_a_id" VARCHAR(64),
"property_b_id" VARCHAR(64),
"value" DECIMAL
);
CREATE OR REPLACE PUMP "DEVICE_PUMP" AS
INSERT INTO "DEVICE_STREAM"
SELECT STREAM s."id",
s."timestamp_mark",
s."device_id",
s."property_a_id",
s."property_b_id",
v."value"
FROM "SOURCE_SQL_STREAM_001" OVER W2 AS s, (
SELECT STREAM "SOURCE_SQL_STREAM_001"."ROWTIME", "id",
"timestamp_mark",
"device_id",
"property_a_id",
"property_b_id",
SUM("value") OVER W1 AS "value",
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (
PARTITION BY "device_id", "property_a_id", "property_b_id" ORDER BY "SOURCE_SQL_STREAM_001".ROWTIME
RANGE INTERVAL '24' HOUR PRECEDING
)
) AS v
WHERE s."id" = v."id"
WINDOW W2 AS (
RANGE INTERVAL '1' SECOND PRECEDING
);
Also I notice that if I restart the Kinesis Analytics application, the SUM values reset, so clearly it doesn't persist across restarts, which might make it unsuitable for this solution. I might have to just setup a SQL server and periodically delete old records.
In general using Streaming Analytics solutions (and Kinesis Analytics in particular) is recommended when you need to do something based on the data in the events and not something external like wall clock time.
The reason is simple: if you need to do something once every 24h, you create a job bringing the data from storage (DB) once, performing your task and then "going to sleep" for another 24h - no complexities, manageable overhead. Now if you need to do something based on the data (e.g. when SUM of some field across multiple events exceeds X) you are in trouble with conventional solution since there is no simple criteria for when it should run. If you run it periodically, it might be invoked many times until the data driven criteria is met, creating a clear overhead.
In the latest case Streaming Analytics solution will be used as designed and trigger your logic just when needed, minimizing the overhead.
If you prefer using Streaming Analytics (which I personally don't recommend based on description of your problem), but struggling with Kinesis Analytics syntax, you might consider using Drools Kinesis Analytics. Among its features are crons and collectors, which provide you with very simple way to trigger jobs on time basis.
Note, that my answer is biased since I'm a CTO at Streamx.
Related
After couple of weeks working with Kinesis Analytics I have a bunch of questions:
Does it have something like watermarks? like in flink
As what I fount, even stagger window are emitted based on real time:
The records are grouped by one-minute stagger windows. The stagger
window starts when the application receives the first AMZN record
(with a ROWTIME of 11:00:20). When the 1-minute stagger window expires
(at 11:00:20) {I guess that is a typo probrbly it should be 11:01:20},
a record with the results that fall within the stagger window (based
on ROWTIME and EVENT_TIME) is written to the output stream. Using a
stagger window, all of the records with a ROWTIME and EVENT_TIME
within a one-minute window are emitted in a single result. taken
here
On the page about sliding windows I found some examples and as far as there you can only choose field for "partition by" I can assume that the only field based on which windowing is made is ROWTIME.
WINDOW W1 AS (
PARTITION BY ticker_symbol
RANGE INTERVAL '1' MINUTE PRECEDING);
Please don't consider this as post of hate, I am using AWS a lot and most of the services are reliable and user friendly. It is more an attempt to point to on some features which are maybe needed by some more users (not only me).
Or if you know how this could already be used in Kinesis Analytics, it would be interesting to know as I did not found anything about those features in documentation.
From what I've read in their documentation here and here
For a time-based windowed query, you specify the window size in terms of time (for example, a one-minute window). This requires a timestamp column in your in-application stream that is monotonically increasing. (The timestamp for a new row is greater than or equal to the previous row.) Amazon Kinesis Data Analytics provides such a timestamp column called ROWTIME for each in-application stream. You can use this column when specifying time-based queries. For your application, you might choose some other timestamp option. For more information, see Timestamps and the ROWTIME Column.
you can have event-time windows, but the timestamps have to be monotonically increasing, which suggests that there is only simple watermarking. A search for watermark does find some hits on this page.
So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.
I have a simple query in ASA from an IoT Hub input to send an average calculation each second to powerbi. I can see that the first data comes to PowerBi 15-20 seconds after IoT Hub receives the input.
Is there anything I can do to decrease this delay?
Query:
SELECT AVG(CAST(acctotal as float)) as average_shake,
CAST(MAX(eventTime) as datetime) as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY eventTime
GROUP BY TumblingWindow(second, 1)
Event Ordering settings are kept to default values
Late arrival Days:00, Hours:00, Minutes:00, Seconds:05
Out of order Minutes:00, Seconds:00
Action: Adjust
If you use the system timestamp instead of event time, I think you will see the delay go away. Try just removing the line "TIMESTAMP BY eventTime"
You can get system time - i.e. the timestamp given to the event as it flows through ASA - through:
SELECT System.Timestamp
As documented in MSDN.
Building onto Josh's response: perhaps you could try something like:
SELECT AVG(CAST(acctotal as float)) as average_shake,
System.Timestamp as time
INTO powerbioutput
FROM iothubinput
TIMESTAMP BY time
GROUP BY TumblingWindow(second, 1)
What is the volume of your input events and what is the number of IoTHub partitions? ASA merges data form IOTHub partitions and arranges events by time to compute aggregation defined in the query. If you have many partitions and relatively small number of events, there could be additional delays as some IoTHub partitions may not have data and ASA will be waiting for the data to appear (max delay is controlled by late arrival policy).
If this is the case, you may want to use fewer IoTHub partitions.
In general, you will see smaller latency in ASA when you process partitions in parallel (use PARTITION BY clause). The drawback is that you will end up with partial aggregate values per partition. You can probably aggregate them further in PowerBI.
I have a project that uses an event hub to receive data, this is sent every second, the data is received by a website using SignalR, this is all working fine, i have been storing the data in to blob storage via a Stream Analytics Job, but this is really slow to access, and with the amount of data i am receiving off just 6 devices, it will get even slower as this increases, i need to access the data to display historical data on via graphs on the website, and then this is topped up with the live data coming in.
I don't really need to store the data every second, so thought about only storing it every 30 seconds instead, but into a SQL DB, what i am trying to do, is still receive the data every second but only store it every 30, i have tried a tumbling window, but from what i can see, this just dumps everything every 30 seconds instead of the single entries.
am i miss understanding the Tumbling, Sliding and Hopping windows, i am guessing i cannot use them in this way ? if that is the case, i am guessing the only way to do it, would be to have the output db as an input, so i can cross reference the timestamp with the current time ?
unless anyone has any other ideas ? any help would be appreciated.
Thanks
am i miss understanding the Tumbling, Sliding and Hopping windows
You are correct that this will put all events within the Tumbling/Sliding/Hopping window together. However, this is only valid within a group by case, which requires a aggregate function over this group.
There is a aggregate function Collect() which will create an array of the events within a group.
I think this should be possible when you group every event within a 30 second tumbling window using Collect(), then in the next step, CROSS APPLY each record, which should output all received events within the 30 seconds.
With Grouper AS (
SELECT Collect() AS records
FROM Input TIMESTAMP BY time
GROUP BY TumblingWindow(second, 30)
)
SELECT
record.ArrayValue.FieldA AS FieldA,
record.ArrayValue.FieldB AS FieldB
INTO Output
FROM Grouper
CROSS APPLY GetArrayElements(Grouper.records) AS record
If you are trying to aggregate 30 entries into one summary row every 30 seconds then a tumbling window is a good choice. Something like the following should work:
SELECT System.TimeStamp AS OutTime, TollId, COUNT(*) as cnt, sum(TollCharge) as TollCharge
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 30)
Thanks for the response, I have been speaking to my contact at Microsoft and he suggested something similar, I had also found something like that in various examples online. what I actually want to do, is only update the database with the data every 30 seconds. so I will receive the event, store it, and I will not store it again until 30 seconds have passed. I am not sure how I can do it with and ASA job to be honest, as I need to have a record of the last time it was updated, I actually have a connection to the event hub from my web site, so in the receiver, I am going to perform a simple check, and then store the data from there.
Long story short, I'm rewriting a piece of a system and am looking for a way to store some hit counters in AWS SimpleDB.
For those of you not familiar with SimpleDB, the (main) problem with storing counters is that the cloud propagation delay is often over a second. Our application currently gets ~1,500 hits per second. Not all those hits will map to the same key, but a ballpark figure might be around 5-10 updates to a key every second. This means that if we were to use a traditional update mechanism (read, increment, store), we would end up inadvertently dropping a significant number of hits.
One potential solution is to keep the counters in memcache, and using a cron task to push the data. The big problem with this is that it isn't the "right" way to do it. Memcache shouldn't really be used for persistent storage... after all, it's a caching layer. In addition, then we'll end up with issues when we do the push, making sure we delete the correct elements, and hoping that there is no contention for them as we're deleting them (which is very likely).
Another potential solution is to keep a local SQL database and write the counters there, updating our SimpleDB out-of-band every so many requests or running a cron task to push the data. This solves the syncing problem, as we can include timestamps to easily set boundaries for the SimpleDB pushes. Of course, there are still other issues, and though this might work with a decent amount of hacking, it doesn't seem like the most elegant solution.
Has anyone encountered a similar issue in their experience, or have any novel approaches? Any advice or ideas would be appreciated, even if they're not completely flushed out. I've been thinking about this one for a while, and could use some new perspectives.
The existing SimpleDB API does not lend itself naturally to being a distributed counter. But it certainly can be done.
Working strictly within SimpleDB there are 2 ways to make it work. An easy method that requires something like a cron job to clean up. Or a much more complex technique that cleans as it goes.
The Easy Way
The easy way is to make a different item for each "hit". With a single attribute which is the key. Pump the domain(s) with counts quickly and easily. When you need to fetch the count (presumable much less often) you have to issue a query
SELECT count(*) FROM domain WHERE key='myKey'
Of course this will cause your domain(s) to grow unbounded and the queries will take longer and longer to execute over time. The solution is a summary record where you roll up all the counts collected so far for each key. It's just an item with attributes for the key {summary='myKey'} and a "Last-Updated" timestamp with granularity down to the millisecond. This also requires that you add the "timestamp" attribute to your "hit" items. The summary records don't need to be in the same domain. In fact, depending on your setup, they might best be kept in a separate domain. Either way you can use the key as the itemName and use GetAttributes instead of doing a SELECT.
Now getting the count is a two step process. You have to pull the summary record and also query for 'Timestamp' strictly greater than whatever the 'Last-Updated' time is in your summary record and add the two counts together.
SELECT count(*) FROM domain WHERE key='myKey' AND timestamp > '...'
You will also need a way to update your summary record periodically. You can do this on a schedule (every hour) or dynamically based on some other criteria (for example do it during regular processing whenever the query returns more than one page). Just make sure that when you update your summary record you base it on a time that is far enough in the past that you are past the eventual consistency window. 1 minute is more than safe.
This solution works in the face of concurrent updates because even if many summary records are written at the same time, they are all correct and whichever one wins will still be correct because the count and the 'Last-Updated' attribute will be consistent with each other.
This also works well across multiple domains even if you keep your summary records with the hit records, you can pull the summary records from all your domains simultaneously and then issue your queries to all domains in parallel. The reason to do this is if you need higher throughput for a key than what you can get from one domain.
This works well with caching. If your cache fails you have an authoritative backup.
The time will come where someone wants to go back and edit / remove / add a record that has an old 'Timestamp' value. You will have to update your summary record (for that domain) at that time or your counts will be off until you recompute that summary.
This will give you a count that is in sync with the data currently viewable within the consistency window. This won't give you a count that is accurate up to the millisecond.
The Hard Way
The other way way is to do the normal read - increment - store mechanism but also write a composite value that includes a version number along with your value. Where the version number you use is 1 greater than the version number of the value you are updating.
get(key) returns the attribute value="Ver015 Count089"
Here you retrieve a count of 89 that was stored as version 15. When you do an update you write a value like this:
put(key, value="Ver016 Count090")
The previous value is not removed and you end up with an audit trail of updates that are reminiscent of lamport clocks.
This requires you to do a few extra things.
the ability to identify and resolve conflicts whenever you do a GET
a simple version number isn't going to work you'll want to include a timestamp with resolution down to at least the millisecond and maybe a process ID as well.
in practice you'll want your value to include the current version number and the version number of the value your update is based on to more easily resolve conflicts.
you can't keep an infinite audit trail in one item so you'll need to issue delete's for older values as you go.
What you get with this technique is like a tree of divergent updates. you'll have one value and then all of a sudden multiple updates will occur and you will have a bunch of updates based off the same old value none of which know about each other.
When I say resolve conflicts at GET time I mean that if you read an item and the value looks like this:
11 --- 12
/
10 --- 11
\
11
You have to to be able to figure that the real value is 14. Which you can do if you include for each new value the version of the value(s) you are updating.
It shouldn't be rocket science
If all you want is a simple counter: this is way over-kill. It shouldn't be rocket science to make a simple counter. Which is why SimpleDB may not be the best choice for making simple counters.
That isn't the only way but most of those things will need to be done if you implement an SimpleDB solution in lieu of actually having a lock.
Don't get me wrong, I actually like this method precisely because there is no lock and the bound on the number of processes that can use this counter simultaneously is around 100. (because of the limit on the number of attributes in an item) And you can get beyond 100 with some changes.
Note
But if all these implementation details were hidden from you and you just had to call increment(key), it wouldn't be complex at all. With SimpleDB the client library is the key to making the complex things simple. But currently there are no publicly available libraries that implement this functionality (to my knowledge).
To anyone revisiting this issue, Amazon just added support for Conditional Puts, which makes implementing a counter much easier.
Now, to implement a counter - simply call GetAttributes, increment the count, and then call PutAttributes, with the Expected Value set correctly. If Amazon responds with an error ConditionalCheckFailed, then retry the whole operation.
Note that you can only have one expected value per PutAttributes call. So, if you want to have multiple counters in a single row, then use a version attribute.
pseudo-code:
begin
attributes = SimpleDB.GetAttributes
initial_version = attributes[:version]
attributes[:counter1] += 3
attributes[:counter2] += 7
attributes[:version] += 1
SimpleDB.PutAttributes(attributes, :expected => {:version => initial_version})
rescue ConditionalCheckFailed
retry
end
I see you've accepted an answer already, but this might count as a novel approach.
If you're building a web app then you can use Google's Analytics product to track page impressions (if the page to domain-item mapping fits) and then to use the Analytics API to periodically push that data up into the items themselves.
I haven't thought this through in detail so there may be holes. I'd actually be quite interested in your feedback on this approach given your experience in the area.
Thanks
Scott
For anyone interested in how I ended up dealing with this... (slightly Java-specific)
I ended up using an EhCache on each servlet instance. I used the UUID as a key, and a Java AtomicInteger as the value. Periodically a thread iterates through the cache and pushes rows to a simpledb temp stats domain, as well as writing a row with the key to an invalidation domain (which fails silently if the key already exists). The thread also decrements the counter with the previous value, ensuring that we don't miss any hits while it was updating. A separate thread pings the simpledb invalidation domain, and rolls up the stats in the temporary domains (there are multiple rows to each key, since we're using ec2 instances), pushing it to the actual stats domain.
I've done a little load testing, and it seems to scale well. Locally I was able to handle about 500 hits/second before the load tester broke (not the servlets - hah), so if anything I think running on ec2 should only improve performance.
Answer to feynmansbastard:
If you want to store huge amount of events i suggest you to use distributed commit log systems such as kafka or aws kinesis. They allow to consume stream of events cheap and simple (kinesis's pricing is 25$ per month for 1K events per seconds) – you just need to implement consumer (using any language), which bulk reads all events from previous checkpoint, aggregates counters in memory then flushes data into permanent storage (dynamodb or mysql) and commit checkpoint.
Events can be logged simply using nginx log and transfered to kafka/kinesis using fluentd. This is very cheap, performant and simple solution.
Also had similiar needs/challenges.
I looked at using google analytics and count.ly. the latter seemed too expensive to be worth it (plus they have a somewhat confusion definition of sessions). GA i would have loved to use, but I spent two days using their libraries and some 3rd party ones (gadotnet and one other from maybe codeproject). unfortunately I could only ever see counters post in GA realtime section, never in the normal dashboards even when the api reported success. we were probably doing something wrong but we exceeded our time budget for ga.
We already had an existing simpledb counter that updated using conditional updates as mentioned by previous commentor. This works well, but suffers when there is contention and conccurency where counts are missed (for example, our most updated counter lost several million counts over a period of 3 months, versus a backup system).
We implemented a newer solution which is somewhat similiar to the answer for this question, except much simpler.
We just sharded/partitioned the counters. When you create a counter you specify the # of shards which is a function of how many simulatenous updates you expect. this creates a number of sub counters, each which has the shard count started with it as an attribute :
COUNTER (w/5shards) creates :
shard0 { numshards = 5 } (informational only)
shard1 { count = 0, numshards = 5, timestamp = 0 }
shard2 { count = 0, numshards = 5, timestamp = 0 }
shard3 { count = 0, numshards = 5, timestamp = 0 }
shard4 { count = 0, numshards = 5, timestamp = 0 }
shard5 { count = 0, numshards = 5, timestamp = 0 }
Sharded Writes
Knowing the shard count, just randomly pick a shard and try to write to it conditionally. If it fails because of contention, choose another shard and retry.
If you don't know the shard count, get it from the root shard which is present regardless of how many shards exist. Because it supports multiple writes per counter, it lessens the contention issue to whatever your needs are.
Sharded Reads
if you know the shard count, read every shard and sum them.
If you don't know the shard count, get it from the root shard and then read all and sum.
Because of slow update propogation, you can still miss counts in reading but they should get picked up later. This is sufficient for our needs, although if you wanted more control over this you could ensure that- when reading- the last timestamp was as you expect and retry.