Avoid webjob waiting when ingest small batch of data to azure data explorer - azure-webjobs

I have a webjob receives site click events from azure event hub, then ingest those events into ADX.
public static async Task Run([EventHubTrigger] EventData[] events, ILogger logger)
{
// Process events
try
{
var ingestResult = await _adxIngester.IngestAsync(events);
if (!ingestResult)
{
AppInsightLogError();
logger.LogError();
}
}
catch(Exception ex)
{
AppInsighLogError();
logger.LogError()
}
}
I've used queue ingestion and turned off FlushImmediately when ingesting to ADX, which enable batch ingestion. When events does not meet default IngestionBatch policy of 1000 events / 1GB data size, ADX waits 5 minutes until it return Success status, which makes Run also waits for that amount of time.
public async Task<bool> IngestAsync(...)
{
IKustoQueuedIngestClient client = KustoIngestFactory.CreateQueuedIngestClient(kustoConnectionString);
var kustoIngestionProperties = new KustoQueuedIngestionProperties(databaseName: "myDB", tableName: "events")
{
ReportLevel = IngestionReportLevel.FailuresOnly,
ReportMethod = IngestionReportMethod.Table,
FlushImmediately = false
};
var streamIdentifier = Guid.NewGuid();
var clientResult = await client.IngestFromStreamAsync(...);
var ingestionStatus = clientResult.GetIngestionStatusBySourceId(streamIdentifier);
while (ingestionStatus.Status == Status.Pending)
{
await Task.Delay(TimeSpan.FromSeconds(15));
ingestionStatus = clientResult.GetIngestionStatusBySourceId(streamIdentifier);
}
if (ingestionStatus.Status == Status.Failed)
{
return false;
}
return true;
}
Since I don't want my webjob to wait that long when there are not many events coming in, or simply QA is at work, I made the following changes:
Don't await on IngestAsync, thus make Run a synchronous method
Add parameter Action onError to IngestAsync and call it when ingest task fails. Call AppInsightLogError() and logger.LogError() inside onError, instead of return false
Replace IngestFromStreamAsync with IngestFromStream
Basically, I want to ensure events reaches Azure Queue and throws exception (if any) before I poll for ingest status, then exit Run method, and I don't have to wait for status polling, if anything fails then it will be log.
My question is:
Is it a good practice to avoid webjob waits for minutes? If no, why ?
If yes, is my solution good enough for this problem? Otherwise how
should I do this?

If you are ingesting small batches of data and wish to cut down on the ingestion batching times, please read the following article: https://learn.microsoft.com/en-us/azure/kusto/concepts/batchingpolicy
Ingestion Batching policy allows you to control the batching limits per database or table.

The ingestion is performed in few phases. One phase is done at the client side, and one phase is done at the server side:
The ingest client code you’re using is going to take your stream and upload it to a blob, and then it will send a message to a queue.
Any exceptions thrown during that phase, will indeed be propagated to your code, which is why you should also use some try-catch block, where in the catch block you can log the error message as you suggested.
You can either use IngestFromStreamAsync with the await keyword, or use IngestFromStream. The first option is better if you’d like to release the worker thread and save resources. But choosing between those two doesn’t have anything to do with the polling. The polling is relevant to the second phase.
Kusto’s DataManagement component is constantly listening to messages in the queue, so as soon as it gets to your new message, it will read it and see some metadata information about the new ingestion request, such as the blob URI where the data is stored and such as the Azure table where failures/progress should be updated.
That phase is done remotely by the server side, and you have an option to wait in your client code for each single ingestion and poll until the server completes the ingestion process. If there are any exceptions during that phase, then of course they won’t be propagated to your client code, but rather you’ll be able to examine the Azure table and see what happened.
You can also decide to defer that status examination, and have it done in some other task.

IngestFromStreamAsync upload your data to a blob and post a message to the Data Management input queue.It will not wait for aggregation time and the final state you will get is Queued.
FlushImmediately defaults to false.
If there isn't any additional processing, consider using the Event Hub to Kusto connection
[Edited] responding to comments:
Queued state indicate the blob is pending ingestion. You can track status by show ingestion failures command, metrics and ingestion logs.
Event hub connection goes through queued ingestion by default. It will use streaming ingestion only if it is set as policy on the database / table
Some of the processing can be done on ADX, using ingestion mapping and update policy.

Related

How can I process a newly uploaded object in Google Cloud Storage exactly once?

I would like to receive files into a Google Cloud Storage bucket and have a Python job run exactly once for each file. I would like many such Python jobs to be running concurrently, in order to process many files in parallel, but each file should only be processed once.
I have considered the following:
Pub/Sub messages
Generate Pub/Sub messages for the OBJECT_FINALIZE event on the bucket. The issue here is that Pub/Sub may deliver messages more than once, so a pool of Python jobs listening to the same subscription may run more than one job for the same message, so I could either...
Use Dataflow to deduplicate messages, but in my non-streaming use case, dataflow seems to be expensive overkill, and this answer seems to suggest it's not the right tool for the job.
or
Create a locking mechanism using a transactional database (say, PostgreSQL on Cloud SQL). Any job receiving a message can attempt to acquire a lock with the same name as the file, any job that fails to acquire a lock can terminate and not ACK the message, and any job with the lock can continue processing and label the lock as done to prevent any future acquisition of that lock.
I think 2 would work but it also feels over-engineered.
Polling
Instead of using Pub/Sub, have jobs poll for new files in the bucket.
This feels like it would simply replace Pub/Sub with a less robust solution that would still require a locking mechanism.
Eventarc
Use Eventarc to trigger a Cloud Run container holding my code. This seems similar to Pub/Sub, and simpler, but I can find no explanation of how Eventarc deals with things like retries, or whether it comes with any exactly-once guarantees.
Single controller spawning multiple workers
Create a central controller process that handles deduplication of file events (received either through Pub/Sub, polling, or Eventarc), then spawns worker jobs and allocates each files exactly once to a worker jobs.
I think this could also work but creates a single point of failure and potentially a throughput bottleneck.
You're on the right track, and yes PubSub Push messages may be delivered more than once.
One simple technique to manage that is to rename the file as you start processing it. Renaming is an atomic transactions, so if it succeeds, you're good to process it.
PROC_PRF = "processing"
bucketName = # get it from the message
fileName = # Get it from the message)
# Renaming of the file below trriggers another google.storage.object.finalize event
if PROC_PRF in fileName:
print("Exiting due to rename event")
# Ack the message an exit
return
storage_client = storage.Client()
bucket = storage_client.bucket(bucketName)
blob = bucket.get_blob(fileName)
try:
newBlob = bucket.rename_blob(blob,new_name = fileName+'.'+PROC_PRF)
except:
raise RuntimeError("Error: File rename from " + fileName + " failed, is this a duplicate function call?")
# The rename worked - process the file & message

Event hub Send event to random partitions but exactly one partition

I have event hub publisher but it is duplicating messages across random partitions multiple times . I want parallel messages for huge number of messages coming in which should go into random but exactly in one partition from where the consumer should get the data .
How do I do that . This is causing the message to be duplicated .
EventHubProducerClientOptions producerClientOptions = new EventHubProducerClientOptions
{
RetryOptions = new EventHubsRetryOptions
{
Mode = EventHubsRetryMode.Exponential,
MaximumRetries = 30,
TryTimeout = TimeSpan.FromSeconds(5),
Delay = TimeSpan.FromSeconds(10),
MaximumDelay = TimeSpan.FromSeconds(15),
}
};
using EventDataBatch eventBatch = await producerClient.CreateBatchAsync();
// Add events to the batch. An event is a represented by a collection of bytes and metadata.
eventBatch.TryAdd(eventMessage);
string logInfo = $"[PUBLISHED - [{EventId}]] =======> {message}";
logger.LogInformation(logInfo);
// Use the producer client to send the batch of events to the event hub
await producerClient.SendAsync(eventBatch);
Your code sample is publishing your batch to the Event Hubs gateway, where events will be routed to a partition. For a successful publish operation, each event will be sent to one partition only.
"Successful" is the key in that phrase. You're configuring your retry policy with a TryTimeout of 5 seconds and allowing 30 retries. The duplication that you're seeing is most likely caused by your publish request timing out due to the very short interval, being successfully received by the service, but leaving the service unable to acknowledge success. This will cause the client to consider the operation a failure and retry.
By default, the TryTimeout interval is 60 seconds. I'm not sure why you've chosen to restrict the timeout to such a small value, but I'd strongly advise considering changes. Respectfully, unless you've done profiling and measuring to prove that you need to make changes, I'd advise using the default values for retries in their entirety.

How to check if the topic-queue is empty and then terminate the subscriber?

In my business application I have to batch-process all the messages from a topic periodically because it is cheaper than processing them in a first-come-first-serve fashion. The current way I am planning to do it is have a cronjob that runs the subscriber every T hours. The problem that I am currently solving is how to terminate the subscriber once all the messages have been processed. I want to fire up the cronjob every T hours, let the subscriber consume all the messages in the topic-queue and terminate. From what I understand, there is no pub-sub Java API that tells me whether the topic-queue is empty or not. I have come up with the following 2 solutions:
Create a subscriber that pulls asynchronously. Sleep for t minutes while it consumes all the messages and then terminate it using subscriber.stopAsync().awaitTerminated();. In this approach, there is a possibility I might not consume all the messages before terminating the subscriber. A google example here
Use Pub/Sub Cloud monitoring to find the value of the metric subscription/num_undelivered_messages. Then pull that many messages using the synchronous pull example provided by Google here. Then terminate the Subscriber.
Is there a better way to do this?
Thanks!
It might be worth considering whether or not Cloud Pub/Sub is the right technology to use for this case. If you want to do batch processing, you might be better off storing the data in Google Cloud Storage or in a database. Cloud Pub/Sub is really best for continuous pulling/processing of messages.
The two suggestions you have are trying to determine when there are no more messages to process. There isn't really a clean way to do this. Your first suggestion is possible, though keep in mind that while most messages will be delivered extremely quickly, there can be outliers that take longer to be sent to your subscriber. If it is critical that all outstanding messages be processed, then this approach may not work. However, if it is okay for messages to occasionally be processed the next time you start up your subscriber, then you could use this approach. It would be best to set up a timer since the last time you received a message as guillaum blaquiere suggests, though I would use a timeout on the order of 1 minute and not 100ms.
Your second suggestion of monitoring the number of undelivered messages and then sending a pull request to retrieve that many messages would not be as viable an approach. First of all, the max_messages property of a pull request does not guarantee that all available messages up to max_messages will be returned. It is possible to get zero messages back in a pull response and still have undelivered messages. Therefore, you'd have to keep the count of messages received and try to match the num_undelivered_messages metric. You'd have to account for duplicate delivery in this scenario and for the fact that the Stackdriver monitoring metrics can lag behind the actual values. If the value is too large, you may be pulling trying to get messages you won't receive. If the value is too small, you may not get all of the messages.
Of the two approaches, the one that tracks how long since the last message has been received is the better one, but with the caveats mentioned.
I have done this same implementation in Go some month ago. My assumption was the following:
If there is messages in the queue, the app consume them very quickly (less than 100ms between 2 messages).
If the queue is empty (my app has finished to consume all the messages), new messages can come but slower than 100ms
Thereby, I implement this:
* Each time that I received a message,
* I suspend the 100ms timeout
* I process and ack the message
* I reset to 0 the 100ms timeout
* If the 100ms timeout is fired, I terminate my pull subscription
In my use case, I schedule my processing each 10 minutes. So, I set a global timeout at 9m30 to finish the processing and let the new app instance to continue the processing
Just a tricky thing: For the 1st message, set the timeout to 2s. Indeed, the first message message takes longer to come because of connexion establishment. Thus set a flag when you init your timeout "is the first message or not".
I can share my Go code if it can help you for your implementation.
EDIT
Here my Go code about the message handling
func (pubSubService *pubSubService) Received() (msgArray []*pubsub.Message, err error) {
ctx := context.Background()
cctx, cancel := context.WithCancel(ctx)
// Connect to PubSub
client, err := pubsub.NewClient(cctx, pubSubService.projectId)
if err != nil {
log.Fatalf("Impossible to connect to pubsub client for project %s", pubSubService.projectId)
}
// Put all the message in a array. It will be processed at the end (stored to BQ, as is)
msgArray = []*pubsub.Message{}
// Channel to receive messages
var receivedMessage = make(chan *pubsub.Message)
// Handler to receive message (through the channel) or cancel the the context if the timeout is reached
go func() {
//Initial timeout because the first received is longer than this others.
timeOut := time.Duration(3000)
for {
select {
case msg := <-receivedMessage:
//After the first receive, the timeout is changed
timeOut = pubSubService.waitTimeOutInMillis // Environment variable = 200
msgArray = append(msgArray, msg)
case <-time.After(timeOut * time.Millisecond):
log.Debug("Cancel by timeout")
cancel()
return
}
}
}()
// Global timeout
go func(){
timeOut = pubSubService.globalWaitTimeOutInMillis // Environment variable = 750
time.Sleep(timeOut * time.Second):
log.Debug("Cancel by global timeout")
cancel()
return
}
// Connect to the subscription and pull it util the channel is canceled
sub := client.Subscription(pubSubService.subscriptionName)
err = sub.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {
receivedMessage <- msg
msg.Ack()
})
}

How to subscribe AWS Lambda to Salesforce Platform Events

We want to integrate Salesforce into out Micro Service Structure in AWS.
There is a article about this here
So we want to subscribe lambda to certain platform events in salesforce.
But i found no code examples for this. I gave it a try using node.js (without lambda). This works great:
var jsforce = require('jsforce');
var username = 'xxxxxxxx';
var password = 'xxxxxxxxxxx';
var conn = new jsforce.Connection({loginUrl : 'https://test.salesforce.com'});
conn.login(username, password, function(err, userInfo) {
if (err) { return console.error(err); }
console.error('Connected '+userInfo);
conn.streaming.topic("/event/Contact_Change__e").subscribe(function(message) {
console.dir(message);
});
});
But i am not sure if this is the right way to do it in lambda.
My understanding of Salesforce Platform Events is that they use CometD under the hood. CometD allows the HTTP client (your code) to subscribe to events published by the HTTP server.
This means your client code needs to be running and be in a state where it is subscribed and listening for server events for the duration of time that you expect to be receiving events. In most cases, this duration is indefinate i.e. your client code expects to wait forever in a subscribed state, ready to receive events.
This is at odds with AWS Lambda functions, which are expected to complete execution in a relatively short amount of time (max 15 minutes last time I checked).
I would suggest you need a long running process, such as a nodejs application running in Elastic Beanstalk, or in a container. The nodejs application can stay running indefinately, in a subscribed state. Each time it receives an event, it could call your AWS Lambda function in order to implement the required actions.

Invoke a AWS Step functions by API Gateway and wait for the execution results

Is it possible to invoke a AWS Step function by API Gateway endpoint and listen for the response (Until the workflow completes and return the results from end step)?
Currently I was able to find from the documentation that step functions are asynchronous by nature and has a final callback at the end. I have the need for the API invocation response getting the end results from step function flow without polling.
I guess that's not possible.
It's async and also there's the API Gateway Timeout
You don't need get the results by polling, you can combine Lambda, Step Functions, SNS and Websockets to get your results real time.
If you want to push a notification to a client (web browser) and you don't want to manage your own infra structure (scaling socket servers and etc) you could use AWS IOT. This tutorial may help you to get started:
http://gettechtalent.com/blog/tutorial-real-time-frontend-updates-with-react-serverless-and-websockets-on-aws-iot.html
If you only need to send the result to a backend (a web service endpoint for example), SNS should be fine.
This will probably work: create an HTTP "gateway" server that dispatches requests to your Steps workflow, then holds onto the request object until it receives a notification that allows it to send a response.
The gateway server will need to add a correlation ID to the payload, and the step workflow will need to carry that through.
One plausible way to receive the notification is with SQS.
Some psuedocode that's vaguely Node/Express flavoured:
const cache = new Cache(); // pick your favourite cache library
const gatewayId = guid(); // this lets us scale horizontally
const subscription = subscribeToQueue({
filter: { gatewayId },
topic: topicName,
});
httpServer.post( (req, res) => {
const correlationId = guid();
cache.add(correlationId, res);
submitToStepWorkflow(gatewayId, correlationId, req);
});
subscription.onNewMessage( message => {
const req = cache.pop(message.attributes.correlationId);
req.send(extractResponse(message));
req.end();
});
(The hypothetical queue reading API here is completely unlike aws-sdk's SQS API, but you get the idea)
So at the end of your step workflow, you just need to publish a message to SQS (perhaps via SNS) ensuring that the correlationId and gatewayId are preserved.
To handle failure, and avoid the cache filling with orphaned request objects, you'd probably want to set an expiry time on the cache, and handle expiry events:
cache.onExpiry( (key, req) => {
req.status(502);
req.send(gatewayTimeoutMessage());
req.end();
}
This whole approach only makes sense for workflows that you expect to normally complete in the sort of times that fit in a browser and proxy timeouts, of course.