Azure Queue Storage Triggered-Webjob - Dequeue Multiple Messages at a time - azure-webjobs

I have an Azure Queue Storage-Triggered Webjob. The process my webjob performs is to index the data into Azure Search. Best practice for Azure Search is to index multiple items together instead of one at a time, for performance reasons (indexing can take some time to complete).
For this reason, I would like for my webjob to dequeue multiple messages together so I can loop through, process them, and then index them all together into Azure Search.
However I can't figure out how to get my webjob to dequeue more than one at a time. How can this be accomplished?

For this reason, I would like for my webjob to dequeue multiple messages together so I can loop through, process them, and then index them all together into Azure Search.
According to your description, I suggest you could try to use Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger to achieve your requirement.
This extension will enable you to trigger functions and receive the group of messages instead of a single message like with [QueueTrigger].
More details, you could refer to below code sample and article.
Install:
Install-Package Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger
Program.cs:
static void Main()
{
var config = new JobHostConfiguration
{
StorageConnectionString = "...",
DashboardConnectionString = "...."
};
config.UseGroupQueueTriggers();
var host = new JobHost(config);
host.RunAndBlock();
}
Function.cs:
//Receive 10 messages at one time
public static void MyFunction([GroupQueueTrigger("queue3", 10)]List<string> messages)
{
foreach (var item in messages)
{
Console.WriteLine(item);
}
}
Result:
How would I get that changed to a GroupQueueTrigger? Is it an easy change?
In my opinion, it is an easy change.
You could follow below steps:
1.Install the package Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger from Nuget Package manager.
2.Change the program.cs file enable UseGroupQueueTriggers.
3.Change the webjobs functions according to your old triggered function.
Note:The group queue message trigger must use list.
As my code sample shows:
public static void MyFunction([GroupQueueTrigger("queue3", 10)]List<string> messages)
This function will get 10 messages from the "queue3" one time, so in this function you could change the function loop the list of the messages and process them, then index them all together into Azure Search.
4.Publish your webjobs to azure web apps.

Related

What causing unreachable errors in Google PubSub?

i'm running application which consists of Google Cloud Functions, triggered by PubSub Topics, so basically they're communicating to each other via Google PubSub.
The problem is, it can struggle sometimes and show delays when publishing messages up to 9s or more. I checked the Metrics Explorer and found out that when high delays it shows next errors:
unreachable_5xx_error_500
unreachable_no_response
internal_rejected_error
unreachable_5xx_error_503
url_4xx_error_429
Here is the chart showing delays:
Code example of publishing message inside Google Cloud Function:
const {PubSub} = require('#google-cloud/pubsub');
const pubSubClient = new PubSub();
async function publishMessage() {
const topicName = 'my-topic';
const dataBuffer = Buffer.from(data);
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}
publishMessage().catch(console.error);
Code example of function triggered by PubSub Topic:
exports.subscribe = async (message) => {
const name = message.data
? Buffer.from(message.data, 'base64').toString()
: 'World';
console.log(`Hello, ${name}!`);
}
And i think this errors causing delays. I didn't find anything on this on the internet, so i hope you can explain what causing this errors and why and probably can help with this.
As it was discussed in the comments, there are some changes and workarounds that can be done to solve or reduce the problem.
At first, as can be found in this guide, PubSub tries to gather multiple messages before delivering it. In other words, it tries to delivery many messages at once. In this specific case to achieve a more realistic real time scenario, should be specified a batch size of 1, which would cause PubSub to delivery every message separately. This batch size can be specified using the maxMessages property in the publisher object creation like in the code below. Besides that, the maxMilliseconds property can be used to specify the maximum latency allowed.
const batchPublisher = pubSubClient.topic(topicName, {
batching: {
maxMessages: maxMessages,
maxMilliseconds: maxWaitTime * 1000,
},
});
In the discussion it was also noticed that the problem is probably related to the Cloud Function's cold-start which makes the latency bigger for this application due to its architecture. The workaround for solving this part of the problem was inserting a Node JS server in the architecture to trigger the functions using PubSub.

How to retrieve the current state of a running Step Functions in AWS

I'm giving the AWS' Step Functions a try and I'm interested in them for implementing long-running procedures. One functionality I would like to provide to my users is the possibility of showing execution's progress. Using describeExecution I can verify if some execution is still running or done. But progress is a logical measure and Step Functions itself has no way to tell me how much of the process is left.
For that, I need to provide the logic myself. I can measure the progress in the tasks of the state machine knowing the total number of steps needed to take and counting the number of steps already taken. I can store this information in the state of the machine which is passed among steps while the machine is running. But how can I extract this state using API? Of course, I can store this information is an external storage like DynamoDb but that's not very elegant!
The solution I have found my self (so far this is the only), is using getExecutionHistory API. This API returned a list of events that are generated for the Step Functions and it can include input or output (or neither) based on whether the event is for a starting a lambda function or is it for the time a lambda function has exited. You can call the API like this:
var params = {
executionArn: 'STRING_VALUE', /* required */
maxResults: 10,
reverseOrder: true
};
stepfunctions.getExecutionHistory(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
By reversing the order of the list of events, we can get the latest ones first. Then we can look for the latest output in the list. The first one you'll find will be the latest version of the output which is the current state of the Step Functions.

How to lock a long async call in a WebApi action?

I have this scenario where I have a WebApi and an endpoint that when triggered does a lot of work (around 2-5min). It is a POST endpoint with side effects and I would like to limit the execution so that if 2 requests are sent to this endpoint (should not happen, but better safe than sorry), one of them will have to wait in order to avoid race conditions.
I first tried to use a simple static lock inside the controller like this:
lock (_lockObj)
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
this is of course not possible because of the await inside the lock statement.
Another solution I considered was to use a SemaphoreSlim implementation like this:
await semaphore.WaitAsync();
try
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
finally
{
semaphore.Release();
}
However, according to MSDN:
The SemaphoreSlim class represents a lightweight, fast semaphore that can be used for waiting within a single process when wait times are expected to be very short.
Since in this scenario the wait times may even reach 5 minutes, what should I use for concurrency control?
EDIT (in response to plog17):
I do understand that passing this task onto a service might be the optimal way, however, I do not necessarily want to queue something in the background that still runs after the request is done.
The request involves other requests and integrations that take some time, but I would still like the user to wait for this request to finish and get a response regardless.
This request is expected to be only fired once a day at a specific time by a cron job. However, there is also an option to fire it manually by a developer (mostly in case something goes wrong with the job) and I would like to ensure the API doesn't run into concurrency issues if the developer e.g. double-sends the request accidentally etc.
If only one request of that sort can be processed at a given time, why not implement a queue ?
With such design, no more need to lock nor wait while processing the long running request.
Flow could be:
Client POST /RessourcesToProcess, should receive 202-Accepted quickly
HttpController simply queue the task to proceed (and return the 202-accepted)
Other service (windows service?) dequeue next task to proceed
Proceed task
Update resource status
During this process, client should be easily able to get status of requests previously made:
If task not found: 404-NotFound. Ressource not found for id 123
If task processing: 200-OK. 123 is processing.
If task done: 200-OK. Process response.
Your controller could look like:
public class TaskController
{
//constructor and private members
[HttpPost, Route("")]
public void QueueTask(RequestBody body)
{
messageQueue.Add(body);
}
[HttpGet, Route("taskId")]
public void QueueTask(string taskId)
{
YourThing thing = tasksRepository.Get(taskId);
if (thing == null)
{
return NotFound("thing does not exist");
}
if (thing.IsProcessing)
{
return Ok("thing is processing");
}
if (!thing.IsProcessing)
{
return Ok("thing is not processing yet");
}
//here we assume thing had been processed
return Ok(thing.ResponseContent);
}
}
This design suggests that you do not handle long running process inside your WebApi. Indeed, it may not be the best design choice. If you still want to do so, you may want to read:
Long running task in WebAPI
https://blogs.msdn.microsoft.com/webdev/2014/06/04/queuebackgroundworkitem-to-reliably-schedule-and-run-background-processes-in-asp-net/

How to get all goals triggered during Sitecore session in commitDataSet Analytics pipeline?

I have an Analytics pipeline added just before the standard one in section to delete duplicate triggered pageevents before submitting all to database so I can have unique triggered events as there seems to be a bug on android/ios devices that triggers several events within few seconds interval.
In this custom pipeline I need to get the list of all goals/events the current user triggered in his session so I can compare with the values in dataset obtained from args parameter and delete the ones already triggered.
The args.DataSet.Tables["PageEvents"] only returns the set to be submitted to database and that doesn't help since it is changing each time this pipeline runs. I also tried Sitecore.Analytics.Tracker.Visitor.DataSet but I get a null value for these properties.
Does anyone knows a way how to get a list with all goals the user triggered so far in his session without requesting it directly to the database ?
Some code:
public class CommitUniqueAnalytics : CommitDataSetProcessor
{
public override void Process(CommitDataSetArgs args)
{
Assert.ArgumentNotNull(args, "args");
var table = args.DataSet.Tables["PageEvents"];
if (table != null)
{
//Sitecore.Analytics.Tracker.Visitor.DataSet.PageEvents - this list always empty
...........
}
}
}
I had a similar question.
In Sitecore 7.5 I found that this worked:
Tracker.Current.Session.Interaction.Pages.SelectMany(x=>x.PageEvents)
However I'm a little worried that this will be inefficient if the Pages collection is very large.

Aggregate parallel results from different web services in Windows Azure

Webservice1 can receive a set of Lon/Lat variables. Based on these variables it returns a resultset of items nearby.
In order to create the resultset Webservice1 has to pass the variables to multiple webservices of our own and multiple external webservices. All these webservice return a resultset. The combination of these resultsets of these secondary Webservices is the resultset to be returned by Webservice1.
What is the best design approach within Windows Azure with costs and performance in mind?
Should we sequential fire requests from Webservice1 to the other webservices wait for a response and continue? Or can we eg use a queue where we post the variables to be picked up by the secondary webservices?
I think you've answered you're own question in the title.
I wouldn't worry about using a queue. Queues are great for sending information off to get dealt with by something else when it doesn't matter how long it takes to process. As you've got a web service that's waiting to return results, this is not ideal.
Sending the requests to each of the other web services one at a time will work and is the easiest option technically, but it won't give you the best performance.
In this situation I would send requests to each of the other web services in parallel using the Task Parallel Library. Presuming the order of the items that you return isn't important your code might look a bit like this.
public List<LocationResult> PlacesOfInterest(LocationParameters parameters)
{
WebService[] webServices = GetArrayOfAllWebServices();
LocationResult[][] results = new LocationResult[webServices.Count()][];
// Call all of the webservices parallel
Parallel.For((long)0,
webServices.Count(),
i =>
{
results[i] = webServices[i].PlacesOfInterest(parameters);
});
var finalResults = new List<LocationResult>();
// Put all the results together
for (int i = 0; i < webServices.Count(); i++)
{
finalResults.AddRange(results[i]);
}
return finalResults;
}