Azure streaming analytics with event hub input stream position - azure-eventhub

Setup
I use Azure stream analytics to stream data into Azure warehouse staging table.
The input source of the job is a EventHub stream.
I notice when I'm updating the job, the job input event backlog goes up massively after the start.
It looks like the job starting to process the complete EventHub queue again from the beginning.
Questions
how is the stream position management organised in stream analytics
is it possible to define a stream position where the job starts (event after queued after a specific point in time for example)
So far done
I notice a similar question here on StackOverflow.
There is mentioned a variable name "eventStartTime".
But since I use an "asaproj" project within visual studio to create, update and deploy the job I don't know where to place this before deploying.

For updating job without stop, it will use previous setting of "Joboutputstarttime", so it is possible for job starting to process the data from the beginning.
you can stop the job first, then choose "Joboutputstarttime" before you will start the job.
You can reference this document https://learn.microsoft.com/en-us/azure/stream-analytics/start-job to see detailed information for each mode. for your scenario, "When last stopped" mode maybe the one you need and it will not process data from beginning of the eventhub queue.

Related

How to break the streams in Docker awslog driver?

I do have a EC2 instance and a docker container running on it. Currently this docker container uses awslog driver to push logs to CloudWatch. If I go to CloudWatch console, I see a very large log stream (with container id as name) which contains all logs of last 16 days (since I've created the container). It almost seems like if I have this container running for 1 year, this log stream will keep all logs of 1 year. I am not quite sure what is the maximum size limit of a CloudWatch log stream, but most likely it will have a limit, at least I believe.
So my question is;
How to chunk this huge logstream? Hopefully by current date, smth like {{.ContainerId}}{{.CurrentDate}}
What is the maximum size limit of a CloudWatch log stream?
Is it a good practice to append onto a single huge log stream?
The following is the definition of Cloudwatch Log Stream as defined in the docs, here
Log streams
A log stream is a sequence of log events that share the same source. More specifically, a log stream is generally intended to represent the sequence of events coming from the application instance or resource being monitored. For example, a log stream may be associated with an Apache access log on a specific host. When you no longer need a log stream, you can delete it using the aws logs delete-log-stream command.
Unfortunately what you want is not possible at the moment. Not sure what exactly is your use-case but you can filter the logs streams using time, so separating them is not really necessary. See start-time and end-time in filter-log-events
You might want to define the following awslog driver options to get a better stream name.
awslogs-stream-prefix see docs

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Google Dataflow and Pubsub - can not achieve exactly-once delivery

I'm trying to achieve exactly-once delivery using Google Dataflow and PubSub using Apache Beam SDK 2.6.0.
Use case is quite simple:
'Generator' dataflow job sends 1M messages to PubSub topic.
GenerateSequence
.from(0)
.to(1000000)
.withRate(100000, Duration.standardSeconds(1L));
'Archive' dataflow job reads messages from PubSub subscription and saves to Google Cloud Storage.
pipeline
.apply("Read events",
PubsubIO.readMessagesWithAttributes()
// this is to achieve exactly-once delivery
.withIdAttribute(ATTRIBUTE_ID)
.fromSubscription('subscription')
.withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
.apply("Window events",
Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.withAllowedLateness(Duration.standardMinutes(15))
.discardingFiredPanes())
.apply("Events count metric", ParDo.of(new CountMessagesMetric()))
.apply("Write files to archive",
FileIO.<String, Dto>writeDynamic()
.by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
.to(archiveDir)
.withTempDirectory(archiveDir)
.withNumShards(options.getNumShards())
.withNaming(dataSource ->
new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
));
I added 'withIdAttribute' to both Pubsub.IO.Write ('Generator' job) and PubsubIO.Read ('Archive' job) and expect that it will guarantee exactly-once semantics.
I would like to test the 'negative' scenario:
'Generator' dataflow job sends 1M messages to PubSub topic.
'Archive' dataflow job starts to work, but I stop it in the middle of processing clicking 'Stop job' -> 'Drain'. Some portion of messages has been processed and saved to Cloud Storage, let's say 400K messages.
I start 'Archive' job again and do expect that it will take unprocessed messages (600K) and eventually I will see exactly 1M messages saved to Storage.
What I got in fact - all messages are delivered (at-least-once is achieved), but on top of that there are a lot of duplicates - something in the neighborhood of 30-50K per 1M messages.
Is there any solution to achieve exactly-once delivery?
Dataflow does not enable you to persist state across runs. If you use Java you can update a running pipeline in a way that does not cause it to lose the existing state, allowing you to deduplicate across pipeline releases.
If this doesn't work for you, you may want to archive messages in a way where they are keyed by ATTRIBUTE_ID, e.g,. Spanner or GCS using this as the file name.
So, I've never done it myself, but reasoning about your problem this is how I would approach it...
My solution is a bit convoluted, but I failed to identify others way to achieve this without involving other external services. So, here goes nothing.
You could have your pipeline reading both from pubsub and GCS and then combine them to de-duplicate the data. The tricky part here is that one would be a bounded pCollection (GCS) and the other an unbounded one (pubsub). You can add timestamps to the bounded collection and then window the data. During this stage you could potentially drop GCS data older than ~15 minutes (the duration of the window in your precedent implementation). These two steps (i.e. adding timestamps properly and dropping data that is probably old enough to not create duplicates) are by far the trickiest parts.
Once this has been solved append the two pCollections and then use a GroupByKey on an Id that is common for both sets of data. This will yield a PCollection<KV<Long, Iterable<YOUR_DATUM_TYPE>>. Then you can use an additional DoFn that drops all but the first element in the resulting Iterable and also removes the KV<> boxing. From there on you can simply continue processing the data as your normally would.
Finally, this additional work should be necessary only for the first pubsub window when restarting the pipeline. After that you should re-assign the GCS pCollection to an empty pCollection so the group by key doesn't do too much additional work.
Let me know what you think and if this could work. Also, if you decide to pursue this strategy, please post your mileage :).
In the meanwhile Pub/Sub has support for Exactly once delivery.
It is currently in the pre-GA launch state, so unfortunately not ready for production use yet.

Where are Azure WebJob QueueTrigger-ed Invocation Logs Stored?

I'm trying to find the queue messages processed in a QueueTrigger webjob. The problem is I didn't save these messages anywhere after processing and now I need them. I know they're available in the SCM WebJobs Dashboard at
https://{sitename}.scm.azurewebsites.net/azurejobs/#/functions/invocations/{invocation-id}
...if I know the {invocation-id}. I have a couple hundred processed messages that I'm trying to retrieve for a specific date range so going page-by-page in a web browser isn't practical.
Does anyone know where these logs are stored that SCM is displaying? I've looked in the azure-jobs-host-output and azure-webjobs-dashboard and can't find the messages anywhere. I've also looked in \data\jobs\continuous\{webjob}\job_log.txt, but this appears to only be the Console.output of a job and not the triggering CloudQueueMessage data which was passed to the webjob function.
Per my research, the detailed invocation logs are under azure-webjobs-dashboard\functions\instances as follows:
Moreover, the list records for Invocation Log are under azure-webjobs-dashboard\functions\recent\flat as follows:

Resume reading from kinesis after a KCL consumer outage [duplicate]

I can't find in the formal documentation of AWS Kinesis any explicit reference between TRIM_HORIZON and the checkpoint, and also any reference between LATEST and the checkpoint.
Can you confirm my theory:
TRIM_HORIZON - In case the application-name is new, then I will read all the records available in the stream. Else, application-name was already used, then I will read from my last checkpoint.
LATEST - In case the application-name is new, then I will read all the records in the stream which added after I subscribed to the stream. Else, application-name was already used, I will read messages from my last checkpoint.
The difference between TRIM_HORIZON and LATEST is only in case the application-name is new.
AT_TIMESTAMP
-- from specific time stamp
TRIM_HORIZON
-- all the available messages in Kinesis stream from the beginning (same as earliest in Kafka)
LATEST
-- from the latest messages , i.e current message that just came into Kinesis/Kafka and all the incoming messages from that time onwords
From GetShardIterator documentation (which lines up with my experience using Kinesis):
In the request, you can specify the shard iterator type AT_TIMESTAMP to read records from an arbitrary point in time, TRIM_HORIZON to cause ShardIterator to point to the last untrimmed record in the shard in the system (the oldest data record in the shard), or LATEST so that you always read the most recent data in the shard.
Basically, the difference is whether you want to start from the oldest record (TRIM_HORIZON), or from "right now" (LATEST - skipping data between latest checkpoint and now).
The question clearly asks how these options relate to the checkpoint. However, none of the existing answers addresses the checkpoint at all.
An authoritative answer to this question by Justin Pfifer appears in a GitHub issue here.
The most relevant portion is
The KCL will always use the value in the lease table if it's present. It's important to remember that Kinesis itself doesn't track the position of consumers. Tracking is provided by the lease table. Leases in the KCL server double duty. They provide both mutual exclusion, and position tracking. So for mutual exclusion a lease needs to be created, and to satisfy the position tracking an initial value must be selected.
(Emphasis added by me.)
I think choosing between either is a trade off between do you want to start from the most recent data or do you want to start from the oldest data that hasnt been processed from kinesis.
Imagine a scenario when there is a bug in your lambda function and it is throwing an exception on the first record it gets and returns an error back to kinesis because of which now none of the records in your kinesis are going to be processed and going to remain there for 1 day period(retention period). Now after you have fixed the bug and deploy your lambda now your lambda will start getting all those messages from the buffer that kinesis has been holding up. Now your downstream service will have to process old data instead of the most recent data. This could add unwanted latency in your application if you choose TRIM_HIROZON.
But if you used LATEST, you can ignore all those previous stuck messages and have your lambda actually start processing from new events/messages and thus improving the latency your system provides.
So you will have to decide which is more important for your customers. Is losing a few data points fine and what is your tolerance limit or you always want accurate results like calculating sum/counter.