Where are Azure WebJob QueueTrigger-ed Invocation Logs Stored? - azure-webjobs

I'm trying to find the queue messages processed in a QueueTrigger webjob. The problem is I didn't save these messages anywhere after processing and now I need them. I know they're available in the SCM WebJobs Dashboard at
https://{sitename}.scm.azurewebsites.net/azurejobs/#/functions/invocations/{invocation-id}
...if I know the {invocation-id}. I have a couple hundred processed messages that I'm trying to retrieve for a specific date range so going page-by-page in a web browser isn't practical.
Does anyone know where these logs are stored that SCM is displaying? I've looked in the azure-jobs-host-output and azure-webjobs-dashboard and can't find the messages anywhere. I've also looked in \data\jobs\continuous\{webjob}\job_log.txt, but this appears to only be the Console.output of a job and not the triggering CloudQueueMessage data which was passed to the webjob function.

Per my research, the detailed invocation logs are under azure-webjobs-dashboard\functions\instances as follows:
Moreover, the list records for Invocation Log are under azure-webjobs-dashboard\functions\recent\flat as follows:

Related

Dynamically Create Cloud Run or Function determined by attribute in pub/sub

I am trying to get a Cloud Run or Cloud Function to start and pull out messages that match its defined ID, for example, if a message with attribute ID 1 is put into the topic, The Cloud Run with ID 1 will take it out, it's important that all messages with attribute 1 go to the same instance.
I understand I could use filters on the subscriptions but I would like to able to easily change the amount of possible ID's, e.g. If I only put messages in the topic with ID's ranging between 0 and 4 then only five instances would be started.
How would I go about creating something like this? Does Pub/Sub support this sort of functionality?
I know I could create X amount of topics and then put each message into its own topic but that seems like a inefficient way of executing this when there is the attribute system.
This was not possible, I had to instead, wait for all data for the desired function to be ready before starting the function, I could not have it continually poll and get the correct data.
Therefore the best approach was to put the data into a firestore, then when all the data was ready for the next layer to process it, I would put a message in a pub sub that contained a message ID, this message ID would determine the data that this function would process.
The function would then query the firestore for messages with a property that included the message ID it was given.
I could not find any other work approaches that would give me the result I desired.

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Dataflow Pipeline - “Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish…”

Since I'm not allowed to ask my question in the same thread where another person have the same problem (but not using a template) I'm creating this new thread.
The problem: Im creating a dataflow job from a template in gcp to ingest data from pub/sub into BQ. This works fine until the job executes. The job gets "stuck" and does not write anything to BQ.
I cant do so much because I cant choose the beam version in the template. This is the error:
Processing stuck in step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 01h00m00s without outputting or completing in state finish
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:803)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:867)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:140)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:112)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
Any ideas how to get this to work?
The issue is coming from the step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite which suggest a problem while writing the data.
Your error can be replicated by (using either Pub/Sub Subscription to BigQuery or Pub/Sub Topic to BigQuery):
Configuring a template with a table that doesn't exist.
Starting the
template with a correct table and delete it during the job execution.
In both cases the stuckness message happens because the data is being read from Pubsub but it is waiting for the table availability to insert the data. The error is being reported each 5 minutes and it gets resolved when the table is created.
To verify the table configured in your template, see the property outputTableSpec in the PipelineOptions in the Dataflow UI.
I had the same issue before. The problem was that I used NestedValueProviders to evaluate the Pub/Sub topic/subscription and this is not supported in case of templated pipelines.
I was getting the same error and reason was that I created an empty BigQuery table without specifying an schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.

Google Dataflow and Pubsub - can not achieve exactly-once delivery

I'm trying to achieve exactly-once delivery using Google Dataflow and PubSub using Apache Beam SDK 2.6.0.
Use case is quite simple:
'Generator' dataflow job sends 1M messages to PubSub topic.
GenerateSequence
.from(0)
.to(1000000)
.withRate(100000, Duration.standardSeconds(1L));
'Archive' dataflow job reads messages from PubSub subscription and saves to Google Cloud Storage.
pipeline
.apply("Read events",
PubsubIO.readMessagesWithAttributes()
// this is to achieve exactly-once delivery
.withIdAttribute(ATTRIBUTE_ID)
.fromSubscription('subscription')
.withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
.apply("Window events",
Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.withAllowedLateness(Duration.standardMinutes(15))
.discardingFiredPanes())
.apply("Events count metric", ParDo.of(new CountMessagesMetric()))
.apply("Write files to archive",
FileIO.<String, Dto>writeDynamic()
.by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
.to(archiveDir)
.withTempDirectory(archiveDir)
.withNumShards(options.getNumShards())
.withNaming(dataSource ->
new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
));
I added 'withIdAttribute' to both Pubsub.IO.Write ('Generator' job) and PubsubIO.Read ('Archive' job) and expect that it will guarantee exactly-once semantics.
I would like to test the 'negative' scenario:
'Generator' dataflow job sends 1M messages to PubSub topic.
'Archive' dataflow job starts to work, but I stop it in the middle of processing clicking 'Stop job' -> 'Drain'. Some portion of messages has been processed and saved to Cloud Storage, let's say 400K messages.
I start 'Archive' job again and do expect that it will take unprocessed messages (600K) and eventually I will see exactly 1M messages saved to Storage.
What I got in fact - all messages are delivered (at-least-once is achieved), but on top of that there are a lot of duplicates - something in the neighborhood of 30-50K per 1M messages.
Is there any solution to achieve exactly-once delivery?
Dataflow does not enable you to persist state across runs. If you use Java you can update a running pipeline in a way that does not cause it to lose the existing state, allowing you to deduplicate across pipeline releases.
If this doesn't work for you, you may want to archive messages in a way where they are keyed by ATTRIBUTE_ID, e.g,. Spanner or GCS using this as the file name.
So, I've never done it myself, but reasoning about your problem this is how I would approach it...
My solution is a bit convoluted, but I failed to identify others way to achieve this without involving other external services. So, here goes nothing.
You could have your pipeline reading both from pubsub and GCS and then combine them to de-duplicate the data. The tricky part here is that one would be a bounded pCollection (GCS) and the other an unbounded one (pubsub). You can add timestamps to the bounded collection and then window the data. During this stage you could potentially drop GCS data older than ~15 minutes (the duration of the window in your precedent implementation). These two steps (i.e. adding timestamps properly and dropping data that is probably old enough to not create duplicates) are by far the trickiest parts.
Once this has been solved append the two pCollections and then use a GroupByKey on an Id that is common for both sets of data. This will yield a PCollection<KV<Long, Iterable<YOUR_DATUM_TYPE>>. Then you can use an additional DoFn that drops all but the first element in the resulting Iterable and also removes the KV<> boxing. From there on you can simply continue processing the data as your normally would.
Finally, this additional work should be necessary only for the first pubsub window when restarting the pipeline. After that you should re-assign the GCS pCollection to an empty pCollection so the group by key doesn't do too much additional work.
Let me know what you think and if this could work. Also, if you decide to pursue this strategy, please post your mileage :).
In the meanwhile Pub/Sub has support for Exactly once delivery.
It is currently in the pre-GA launch state, so unfortunately not ready for production use yet.

Filter AWS Cloudwatch Lambda's Log

I have a Lambda function and its logs in Cloudwatch (Log group and Log Stream). Is it possible to filter (in Cloudwatch Management Console) all logs that contain "error"? For example logs containing "Process exited before completing request".
In Log Groups there is a button "Search Events". You must click on it first.
Then it "changes" to "Filter Streams":
Now you should just type your filter and select the beginning date-time.
So this is kind of a side issue, but it was relevant for us. (I posted this to another answer on StackOverflow but thought it would be relevant to this conversation too)
We've noticed that tailing and searching logs gets really slow after a log group has a lot of Log Streams in it, like when an AWS Lambda Function has had a lot of invocations. This is because "tail" type utilities and searching need to connect to each log stream to run. Log Events get expired and deleted due to the policy you set on the Log Group itself, but the Log Streams never get cleaned up. I made a few little utility scripts to help with that:
https://github.com/four43/aws-cloudwatch-log-clean
Hopefully that save you some agony over waiting for those logs to get searched.
You can also use CloudWatch Insights (https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-cloudwatch-logs-insights-fast-interactive-log-analytics/) which is an AWS extension to CloudWatch logs that gives a pretty powerful query and analytics tool. However it can be slow. Some of my queries take up to a minute. Okay, if you really need that data.
You could also use a tool I created called SenseLogs. It downloads CloudWatch data to your browser where you can do queries like you ask about. You can use either full text and search for "error" or if your log data is structured (JSON), you can use a Javascript like expression language to filter by field, eg:
error == 'critical'
Posting an update as CloudWatch has changed since 2016:
In the Log Groups there is a Search all button for a full-text search
Then just type your search: