How do I get user task list with its process variables in Camunda - camunda

I have a requirement where a user could be assigned thousands (1000 - 5000) tasks, belonging to different process instances (same user task from 1000 - 5000) instances at a given time. I have a custom task list screen where I need to load all the tasks with their basic info (id, name, process instance id etc) and some process variables for each.
First I used the filter/list REST service i.e. engine-rest/filter/{filter-id}/list to get the tasks with the process variables. (I created a filter in Camunda tasklist). But this REST service takes forever to return when there are more than 1000 process instances in question. It took 7-8 mins for about 2000 process instances. Maybe because this service returns a lot of information which I don't need.
So I decided to write my own REST service using Camunda Java api. This is what I did -
List<Task> tasks = taskService.createTaskQuery().processDefinitionKey(processDefinitionKey).taskAssignee(assignee).list();
if(tasks != null && !tasks.isEmpty()){
for(Task task : tasks){
.....
.....
Map<String, Object> variables = taskService.getVariables(task.getId(), variableNames);
.....
}}
This works and is much faster than the filter service. But for about 1000 instances it is taking around 25 secs. (My server is not production grade right now, Tomcat Xms -1gb Xmx - 2gb).
But my concern is that internally is this code hitting the DB 1000 times (for each tasks returned by taskquery) to get the variables? Worse still depending on the number of variables is it querying the DB that many times for each variable? I mean for 5 variables are we hitting the DB 5000 times?
1) If so, is there any way I can improve this service? Like can I write a NativeTaskQuery where I join the act_ru_task, act_ru_process & act_ru_variable tables to get the data I need? Is that the right way?
2) Isn't there any inbuilt caching in Camunda that can help here?
Thanks in advance for your help.

You can use a custom query for this. Write your native sql query and add a mybatis mapping. This example explains the concept : https://github.com/camunda-consulting/code/tree/master/snippets/custom-queries

Related

GCP Datastore times out on large download

I'm using Objectify to access my GCP Datastore set of Entites. I have a full list of around 22000 items that I need to load into the frontend:
List<Record> recs = ofy().load().type(Record.class).order("-sync").list();
The number of records has recently increased and I get an error from the backend:
com.google.apphosting.runtime.HardDeadlineExceededError: This request (00000185caff7b0c) started at 2023/01/19 17:06:58.956 UTC and was still executing at 2023/01/19 17:08:02.545 UTC.
I thought that the move to Cloud Firestore in Datastore mode last year would have fixed this problem.
My only solution is to break down the load() into batches using 2 or 3 calls to my Ofy Service.
Is there a better way to grab all these Entities in one go?
Thanks
Tim

How to handle background jobs in "Cloud Run" when new instance is created immediately?

I have a FastAPI project in Cloud Run and it has some background jobs inside it. (Not heavy stuff)
However, when a new instance is being created by Cloud Run due to number of requests etc. every instance runs the background job concurrently.
For example;
I have a task that creates some invoices for customers in the background and if three instances is created immediately, three invoices will be created.
I researched about "FOR UPDATE" usage in PostgreSQL etc. It seems like I can solve by modifying my database but I just wonder if it can be solved in Cloud's side.
I don't want to limit the max. number of instances to 1
What would you do in this situation?
Thank you for your time.
If you can potentially have N instances of a job (because you don't want to set the max limit to 1), you need to implement your jobs in an idempotent way. Broadly speaking, you have a few ways to achieve idempotency:
by enforcing a business constraint.
by storing an idempotency key.
by using the Etag HTTP response header.
For example, Stripe lets you define an idempotency key for all of your API requests. Stripe stores this key on its servers, and when you make a POST request with the same payload of a previous one, Stripe returns you the same result. POST requests are not idempotent, but using this "trick" they become idempotent.
Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result, including 500 errors.
https://stripe.com/docs/api/idempotent_requests
Tip: you could expand your question by clarifying how these background tasks are created, and where they run.

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Camunda History and Audit Event Log, I can't query any history data

I have been running Camunda with MariaDB, it's a good solution
but I have a problem, I saw the Camunda User Guide that describes History and Audit Event Log, so I write some codes as follows:
List historyList = historyService.createHistoricProcessInstanceQuery().finished().processDefinitionId("Sample1").list();
int historySize = historyList.size();
LOGGER.info("historyList size=" + historySize);
I have finished the Sample1 Process, but the historySize still is zero, I think I lost some configuration, how can I do?
Wnat's difference between Runtime Database and Histort Database? do I need to install two Databases?
Thank you
I solved the Problem by using processDefinitionKey
I passed processDefinitionKey to processDefinitionId, no wonder it can't get all finished instances of a process
I can get all finished instances of a process using correct processDefinitionId or processDefinitionKey

How can I write data about process assignees to database

I use camunda 7.2.0 and i'm not very experienced with it. I'm trying to write data about users, who had done something with process instance to database (i'm using rest services) to get some kind of reports later. The problem is that i don't know how to trigger my rest(that sends information to datebase about current user and assignee) when user assignes task to somebody else or claims task to himself. I see that camunda engine sends request like
link: engine/engine/default/task/5f965ab7-e74b-11e4-a710-0050568b5c8a/assignee
post: {"userId":"Tom"}
As partial solution I can think about creating a global variable "currentUser" and on form load check if user is different from current, and if he is - run the rest and change variable. But this solution don't looks correct to me. So is there any better way to do it? Thanks in advance
You could use a task listener which updates your data when the assignee of a task is changed. If you want this behavior for every task you could define a global task listener.