Prefetch Count when using RabbitMQ clustering - concurrency

I'm using RabbitMQ(3.8.14 ver) and Masstransit(8.0.1 ver).
I need to guarantee each message given Ack, then another packet starts consuming. (actually, in my queue, I generate a unique incremental key, then I must ensure at the moment only a packet gets consumed.)
this is noticeable to say, my rabbit MQ was configured as a cluster with three nodes.
for that, I configured the prefetch count with a "1" value.
but in the load test before giving Ack for the first message, the second message was consumed. despite descriptions given about "prefetch count".
sample code:
public class Caller
{
///ctor inections
private static int number= 0;
public async Task publish()
{
await _publishEndpoint.Publish(new GeneratedChipsEvent { messageNumber = ++number}); ;
}
}
public class GeneratedChipsEvent
{
public int messageNumber { get; init; }
}
public class GeneratedChipsEventConsumer : IConsumer<GeneratedChipsEvent>
{
///ctor inections
public async Task Consume(ConsumeContext<GeneratedChipsEvent> context)
{
_logger.LogInformation($"starting consuming the messsage:{context.Message.messageNumber}");
await Task.Delay(System.TimeSpan.FromSeconds(3));
_logger.LogInformation($"ending consuming the messsage:{context.Message.messageNumber}");
} }
logs in my local machine(non-cluster) that wait for ack first message, then starts consuming the second message:
[10:06:08 INF] starting consuming the messsage:1
[10:06:11 INF] ending consuming the messsage:1
[10:06:11 INF] starting consuming the messsage:2
[10:06:14 INF] ending consuming the messsage:2
[10:06:14 INF] starting consuming the messsage:3
[10:06:17 INF] ending consuming the messsage:3
logs in the server(as a cluster), that don't wait for ack first message:
[10:16:33 INF] starting consuming the messsage:1
[10:16:33 INF] starting consuming the messsage:2
[10:16:33 INF] starting consuming the messsage:3
[10:16:36 INF] ending consuming the messsage:1
[10:16:36 INF] ending consuming the messsage:2
[10:16:36 INF] ending consuming the messsage:3

If you are running three instances of your service, you'll see three messages consumed at a time, one on each instance. The prefetch count is per instance, it is not a global lock.
To only have one consumer active at a time, that can be specified on the receive endpoint configurator:
e.SingleActiveConsumer = true;

Related

AWS Elasticsearch performance issue

Have an index which is search heavy. Rpm varies from 15-20k. Issue is, for first few days resp time of search query will be around 15ms. But it will start increasing gradually and touches ~70ms. Some of the requests starts queuing(as per Search thread pool graph in aws console) but there were no rejection. Queuing would increase latency of the search request.
Got to know that queuing will happen if there is pressure on resource. I think I have sufficient cpu and memory, plz look at config below.
Enabled slow query logs, but didnt find any anamoly. Even though average resp time is around 16ms, I see few queries going above 50ms. But there was no issue in search query. Searchable documents is around 8k.
Need your suggestion on how to improve performance here. Document mapping, search query and ES config are given below. Is there any issue in mapping or query here?
Mapping:
{
"data":{
"mappings":{
"_doc":{
"properties":{
"a":{
"type":"keyword"
},
"b":{
"type":"keyword"
}
}
}
}
}
}
Search query:
{
"size":5000,
"query":{
"bool":{
"filter":[
{
"terms":{
"a":[
"all",
"abc"
],
"boost":1
}
},
{
"terms":{
"b":[
"all",
123
],
"boost":1
}
}
],
"adjust_pure_negative":true,
"boost":1
}
},
"stored_fields":[]
}
Im using keyword in mapping and terms in search query as I want to search for exact value.Boost and adjust_pure_negative are added automatically. From what I read, they should not affect performance.
Index settings:
{
"data":{
"settings":{
"index":{
"number_of_shards":"1",
"provided_name":"data",
"creation_date":"12345678154072",
"number_of_replicas":"7",
"uuid":"3asd233Q9KkE-2ndu344",
"version":{
"created":"10499"
}
}
}
}
}
ES config:
Master node instance type: m5.large.search
Master nodes: 3
Data node instance type: m5.2xlarge.search
Data nodes: 8 (8 vcpu, 32 GB memory)

DeadlineExceededException when creating tasks on startup

I have a Spring Boot 2.4.5 application deployed on Google Cloud Run (image created with Jib). On startup I want to create a Cloud Task but I get a DeadlineExceededException.
If I run the task creation code but triggered by an HTTP request, the task is created. And the task that was supposed to be created on startup is also created. It's like something is missing at the startup that prevents task to be created.
The startup event
#EventListener(ApplicationReadyEvent.class)
public void doSomethingAfterStartup() {
LOGGER.info("ApplicationReadyEvent");
String message = "GCP New Instance Start " + Instant.now();
cloudTasksService.createTask("xxxx", "us-central1", "xxxx", message, 60);
}
The task creation code
public void createTask(String projectId, String locationId, String queueId, String message, Integer delay) throws IOException {
try (CloudTasksClient client = CloudTasksClient.create()) {
LOGGER.info("Client created");
String url = "xxxxxxxxx";
String payload = String.format("{ \"text\": \"%s\"}", message);
String queuePath = QueueName.of(projectId, locationId, queueId).toString();
Instant eta = Instant.now().plusSeconds(delay);
Task.Builder taskBuilder =
Task.newBuilder()
.setScheduleTime(Timestamp.newBuilder().setSeconds(eta.getEpochSecond()).build())
.setHttpRequest(
HttpRequest.newBuilder()
.setBody(ByteString.copyFrom(payload, Charset.defaultCharset()))
.setUrl(url)
.setHttpMethod(HttpMethod.POST)
.build());
LOGGER.info("TaskBuilder ready");
Task task = client.createTask(queuePath, taskBuilder.build());
LOGGER.info("Task created: {}", task.getName());
}
}
The HTTP endpoint
#GetMapping("/tasks")
public ResponseEntity<Void> task(#RequestParam Integer delay) throws IOException {
cloudTasksService.createTask("xxxx", "us-central1", "xxxx", "using HTTP request", delay);
return ResponseEntity.accepted().build();
}
The exception
com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Deadline exceeded after 5.200272920s.
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:51)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:724)
at com.google.common.util.concurrent.ForwardingListenableFuture.addListener(ForwardingListenableFuture.java:45)
at com.google.api.core.ApiFutureToListenableFuture.addListener(ApiFutureToListenableFuture.java:52)
at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1047)
at com.google.api.core.ApiFutures.addCallback(ApiFutures.java:63)
at com.google.api.gax.grpc.GrpcExceptionCallable.futureCall(GrpcExceptionCallable.java:67)
at com.google.api.gax.rpc.UnaryCallable$1.futureCall(UnaryCallable.java:126)
at com.google.api.gax.tracing.TracedUnaryCallable.futureCall(TracedUnaryCallable.java:75)
at com.google.api.gax.rpc.UnaryCallable$1.futureCall(UnaryCallable.java:126)
at com.google.api.gax.rpc.UnaryCallable.futureCall(UnaryCallable.java:87)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.tasks.v2.CloudTasksClient.createTask(CloudTasksClient.java:1915)
at com.google.cloud.tasks.v2.CloudTasksClient.createTask(CloudTasksClient.java:1885)
at com.sps.playground.CloudTasksService.createTask(CloudTasksService.java:55)
It looks like the workers are not ready when the task is being queued. I'd recommend creating tasks on startup as this can happen often due to workers not passing the readiness check when the task is being processed, and then it fails. That wold also explain why the task runs normally when triggered by HTTP.
You can tackle this by decreasing the time of your startup following the recommendations. Also, as you are using Java with Springboot, it may be worth checking Reducing startup tasks recommendations as well.

GCP Cloud Tasks: shorten period for creating a previously created named task

We are developing a GCP Cloud Task based queue process that sends a status email whenever a particular Firestore doc write-trigger fires. The reason we use Cloud Tasks is so a delay can be created (using scheduledTime property 2-min in the future) before the email is sent, and to control dedup (by using a task-name formatted as: [firestore-collection-name]-[doc-id]) since the 'write' trigger on the Firestore doc can be fired several times as the document is being created and then quickly updated by backend cloud functions.
Once the task's delay period has been reached, the cloud-task runs, and the email is sent with updated Firestore document info included. After which the task is deleted from the queue and all is good.
Except:
If the user updates the Firestore doc (say 20 or 30 min later) we want to resend the status email but are unable to create the task using the same task-name. We get the following error:
409 The task cannot be created because a task with this name existed too recently. For more information about task de-duplication see https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create#body.request_body.FIELDS.task.
This was unexpected as the queue is empty at this point as the last task completed succesfully. The documentation referenced in the error message says:
If the task's queue was created using Cloud Tasks, then another task
with the same name can't be created for ~1hour after the original task
was deleted or executed.
Question: is there some way in which this restriction can be by-passed by lowering the amount of time, or even removing the restriction all together?
The short answer is No. As you've already pointed, the docs are very clear regarding this behavior and you should wait 1 hour to create a task with same name as one that was previously created. The API or Client Libraries does not allow to decrease this time.
Having said that, I would suggest that instead of using the same Task ID, use different ones for the task and add an identifier in the body of the request. For example, using Python:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
def create_task(project, queue, location, payload=None, in_seconds=None):
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
task = {
'app_engine_http_request': {
'http_method': 'POST',
'relative_uri': '/task/'+queue
}
}
if payload is not None:
converted_payload = payload.encode()
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task['schedule_time'] = timestamp
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
print(response)
#You can change DOCUMENT_ID with USER_ID or something to identify the task
create_task(PROJECT_ID, QUEUE, REGION, DOCUMENT_ID)
Facing a similar problem of requiring to debounce multiple instances of Firestore write-trigger functions, we worked around the default Cloud Tasks task-name based dedup mechanism (still a constraint in Nov 2022) by building a small debounce "helper" using Firestore transactions.
We're using a helper collection _syncHelper_ to implement a delayed throttle for side effects of write-trigger fires - in the OP's case, send 1 email for all writes within 2 minutes.
In our case we are using Firebease Functions task queue utils and not directly interacting with Cloud Tasks but thats immaterial to the solution. The key is to determine the task's execution time in advance and use that as the "dedup key":
async function enqueueTask(shopId) {
const queueName = 'doSomething';
const now = new Date();
const next = new Date(now.getTime() + 2 * 60 * 1000);
try {
const shouldEnqueue = await getFirestore().runTransaction(async t=>{
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate()> now) {
return false;
}
await t.set(syncRef, { timestamp: Timestamp.fromDate(next) });
return true;
});
if (shouldEnqueue) {
let queue = getFunctions().taskQueue(queueName);
await queue.enqueue({
timestamp: next.toISOString(),
},
{ scheduleTime: next }); }
} catch {
...
}
}
This will ensure a new task is enqueued only if the "next execution" time has passed.
The execution operation (also a cloud function in our case) will remove the sync data entry if it hasn't been changed since it was executed:
exports.doSomething = functions.tasks.taskQueue({
retryConfig: {
maxAttempts: 2,
minBackoffSeconds: 60,
},
rateLimits: {
maxConcurrentDispatches: 2,
}
}).onDispatch(async data => {
let { timestamp } = data;
await sendYourEmailHere();
await getFirestore().runTransaction(async t => {
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate() <= new Date(timestamp)) {
await t.delete(syncRef);
}
});
});
This isn't a bullet proof solution (if the doSomething() execution function has high latency for example) but good enough for 99% of our use cases.

Get tasks status in AWS Step Functions (boto3)

I am currently using boto3 (the Amazon Web Services (AWS) SDK for Python) to create state machines, start executions and also in my workers to retrieve tasks and report their status (completed successfully or failed).
I have another service that needs to know the tasks' status and I would like to do so by retrieving it from AWS. I searched the available methods and it is only possible to get the status of a state machine/execution as a whole (RUNNING|SUCCEEDED|FAILED|TIMED_OUT|ABORTED).
There is also the get_execution_history method but each step is identified by an id numbered sequentially and there is no information about the task itself (only in the "stateEnteredEventDetails" event, where the name of the task is present, but the subsequentially events may not be related to it, so it is impossible to know if the task was successful or not).
Is it really not possible to retrieve the status of a specific task, or am I missing something?
Thank you!
I had the same problem, and it seems that step functions does not consider the states and tasks as entities, and therefore there is not an API to get info about them.
In order to get info about the task's status you need to parse the information in the execution history. In my case I first check the execution status:
import boto3
import json
client = boto3.client("stepfunctions")
response = client.describe_execution(
executionArn=EXECUTION_ARN
)
status = response["status"]
and if it is "FAILED" then I analyze the history and get the most relevant fields for my use case (for events of type "TaskFailed"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000
)
events = response["events"]
while response.get("nextToken"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000,
nextToken=response["nextToken"]
)
events += response["events"]
causes = [
json.loads(e["taskFailedEventDetails"]["cause"])
for e in events
if e["type"] == "TaskFailed"
]
return [
{
"ClusterArn": cause["ClusterArn"],
"Containers": [
{
"ContainerArn": container["ContainerArn"],
"Name": container["Name"],
"ExitCode": container["ExitCode"],
"Overrides": cause["Overrides"]["ContainerOverrides"][i]
}
for i, container in enumerate(cause["Containers"])
],
"TaskArn": cause["TaskArn"],
"StoppedReason": cause["StoppedReason"]
}
for cause in causes
]

Not able to solve throttlingException in DynamoDB

I have a lambda function which does a transaction in DynamoDB similar to this.
try {
const reservationId = genId();
await transactionFn();
return {
statusCode: 200,
body: JSON.stringify({id: reservationId})
};
async function transactionFn() {
try {
await docClient.transactWrite({
TransactItems: [
{
Put: {
TableName: ReservationTable,
Item: {
reservationId,
userId,
retryCount: Number(retryCount),
}
}
},
{
Update: {
TableName: EventDetailsTable,
Key: {eventId},
ConditionExpression: 'available >= :minValue',
UpdateExpression: `set available = available - :val, attendees= attendees + :val, lastUpdatedDate = :updatedAt`,
ExpressionAttributeValues: {
":val": 1,
":updatedAt": currentTime,
":minValue": 1
}
}
}
]
}).promise();
return true
} catch (e) {
const transactionConflictError = e.message.search("TransactionConflict") !== -1;
// const throttlingException = e.code === 'ThrottlingException';
console.log("transactionFn:transactionConflictError:", transactionConflictError);
if (transactionConflictError) {
retryCount += 1;
await transactionFn();
return;
}
// if(throttlingException){
//
// }
console.log("transactionFn:e.code:", JSON.stringify(e));
throw e
}
}
It just updating 2 tables on api call. If it encounter a transaction conflict error, it simply retry the transaction by recursively calling the function.
The eventDetails table is getting too much db updates. ( checked it with aws Contributor Insights) so, made provisioned unit to a higher value than earlier.
For reservationTable Provisioned capacity is on Demand.
When I do load test over this api with 400 (or more) users using JMeter (master slave configuration) I am getting Throttled error for some api calls and some api took more than 20 sec to respond.
When I checked X-Ray for this api found that, DynamoDB is taking too much time for this transasction for the slower api calls.
Even with much fixed provisioning ( I tried on demand scaling too ) , I am getting throttled exception for api calls.
ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded.
Consider increasing your provisioning level with the UpdateTable API.
UPDATE
And one more thing. When I do the load testing, I am always uses the same eventId. It means, I am always updating the same row for all the api requests. I have found this article, which says that, a single partition can only have upto 1000 WCU. Since I am always updating the same row in the eventDetails table during load testing, is that causing this issue ?
I had this exact error and it helped me to change the On Demand to Provisioned under Read/write capacity mode. Try to change that, if that doesn't help, we'll go from there.
From the link you cite in your update, also described in an AWS help article here, it sounds like the issue is that all of your load testers are writing to the same entry in the table, which is going to be in the same partition, subject to the hard limit of 1,000 WCU.
Have you tried repeating this experiment with the load testers writing to different partitions?