What causing unreachable errors in Google PubSub? - google-cloud-platform

i'm running application which consists of Google Cloud Functions, triggered by PubSub Topics, so basically they're communicating to each other via Google PubSub.
The problem is, it can struggle sometimes and show delays when publishing messages up to 9s or more. I checked the Metrics Explorer and found out that when high delays it shows next errors:
unreachable_5xx_error_500
unreachable_no_response
internal_rejected_error
unreachable_5xx_error_503
url_4xx_error_429
Here is the chart showing delays:
Code example of publishing message inside Google Cloud Function:
const {PubSub} = require('#google-cloud/pubsub');
const pubSubClient = new PubSub();
async function publishMessage() {
const topicName = 'my-topic';
const dataBuffer = Buffer.from(data);
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}
publishMessage().catch(console.error);
Code example of function triggered by PubSub Topic:
exports.subscribe = async (message) => {
const name = message.data
? Buffer.from(message.data, 'base64').toString()
: 'World';
console.log(`Hello, ${name}!`);
}
And i think this errors causing delays. I didn't find anything on this on the internet, so i hope you can explain what causing this errors and why and probably can help with this.

As it was discussed in the comments, there are some changes and workarounds that can be done to solve or reduce the problem.
At first, as can be found in this guide, PubSub tries to gather multiple messages before delivering it. In other words, it tries to delivery many messages at once. In this specific case to achieve a more realistic real time scenario, should be specified a batch size of 1, which would cause PubSub to delivery every message separately. This batch size can be specified using the maxMessages property in the publisher object creation like in the code below. Besides that, the maxMilliseconds property can be used to specify the maximum latency allowed.
const batchPublisher = pubSubClient.topic(topicName, {
batching: {
maxMessages: maxMessages,
maxMilliseconds: maxWaitTime * 1000,
},
});
In the discussion it was also noticed that the problem is probably related to the Cloud Function's cold-start which makes the latency bigger for this application due to its architecture. The workaround for solving this part of the problem was inserting a Node JS server in the architecture to trigger the functions using PubSub.

Related

Process 10req/s and save to cloud storage - recommended method?

I have 10 requests per second of data I want to save that looks like the entry below. I need to save this data after a CloudRun function completes. (My infrastructure is on google-cloud-platform). The data will be used as a data set for machine learning.
{
"text": "1k characters",
"text2": "1k characters",
"metadata1": "enum (100 vals)",
"metadata2": "number value"
}
I planned to save this as a non-awaited function to google-cloud-storage either in one folder or in folders based on the metadata1 enum. Is either better than the other?
Is this the appropriate route to take?
I think pubsub is overkill as suggested in this SO answer.
I can propose you 2 patterns, but in both case you need to store the messages:
Either use PubSub to stack the messages. Then, use Dataflow to read pubsub and to sink to Cloud Storage. Or use a on demand service (Cloud Run for exemple) to pull your PubSub subscription and write a file with all the message read (You can trigger your Cloud Run with Cloud Scheduler, every hour for example)
Or store the message in BigQuery, and then perform query export to GCS regularly (again with a Cloud Scheduler + Cloud Functions/Run). It's my preferred solution, because, maybe a day, you will have to process differently your message, and to get metrics/perform analytics on them.
#guillaume's answer is definitely the best one, but for ease of implement-ability, I decided to just save them directly to GCS.
const saveData = async ({ text, text2, enum, number }) => {
try {
const timestamp = new Date().getTime()
const folder = enum
const fileName = `${folder}/${enum}-${timestamp}.json`
const file = bucket.file(fileName)
const contents = JSON.stringify({ text, text2, enum, number })
return file.save(contents)
}
} catch (e) {
console.log(`Failed to save file, ${e.message}`)
}
}
It added some latency, but overall I estimated the cost to be about $10 in server costs a month as compared to pubsub method which when trying to determine the cost, put it around $50-100 bucks a month (or more, was hard to determine. But it did assume that each message is 1MB if it's under 1MB).
The big query method Guillaume provided appeared to have no cost since 1TB of transferred data is free each month. I could be wrong on this. I may switch to this later on.

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

Error: 4 DEADLINE_EXCEEDED: Deadline Exceeded at Object.exports.createStatusError - GCP

I am trying to create a google cloud task from one of my Google Cloud Functions. This function gets triggered when a new object is added to one of my Cloud Storage buckets.
I followed the instructions given here to create my App Engine (App Engine Quickstart Guide)
Then in my Cloud Function, I added the following code to create a cloud task (as described here - Creating App Engine Tasks)
However, there is something wrong with my task or App Engine call (not sure what).
I am getting the following errors every now and then. Sometimes it works and sometimes it does not.
{ Error: 4 DEADLINE_EXCEEDED: Deadline Exceeded at Object.exports.createStatusError (/srv/node_modules/grpc/src/common.js:91:15) at Object.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:1204:28) at InterceptingListener._callNext (/srv/node_modules/grpc/src/client_interceptors.js:568:42) at InterceptingListener.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:618:8) at callback (/srv/node_modules/grpc/src/client_interceptors.js:845:24) code: 4, metadata: Metadata { _internal_repr: {} }, details: 'Deadline Exceeded' }
Do let me know if you need more information and I will add them to this question here.
I had the same problem with firestore, trying to write one doc at time; I solve it by returning the total promise. That is because cloud function needs to know when is convenient to terminate the function but if you do not return anything maybe cause this error.
My example:
data.forEach( d => {
reports.doc(_date).collection('data').doc(`${d.Id}`).set(d);
})
This was the problem with me, I was writing document 1 by 1 but I wasn't returning the promise. So I solve it doing this:
const _datarwt = [];
data.forEach( d => {
_datarwt.push( reports.doc(_date).collection('data').doc(`${d.Id}`).set(d) );
})
const _dataloaded = await Promise.all( _datarwt );
I save the returned promise in an array and await for all the promises. That solved it for me. Hope been helpful.

Azure Queue Storage Triggered-Webjob - Dequeue Multiple Messages at a time

I have an Azure Queue Storage-Triggered Webjob. The process my webjob performs is to index the data into Azure Search. Best practice for Azure Search is to index multiple items together instead of one at a time, for performance reasons (indexing can take some time to complete).
For this reason, I would like for my webjob to dequeue multiple messages together so I can loop through, process them, and then index them all together into Azure Search.
However I can't figure out how to get my webjob to dequeue more than one at a time. How can this be accomplished?
For this reason, I would like for my webjob to dequeue multiple messages together so I can loop through, process them, and then index them all together into Azure Search.
According to your description, I suggest you could try to use Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger to achieve your requirement.
This extension will enable you to trigger functions and receive the group of messages instead of a single message like with [QueueTrigger].
More details, you could refer to below code sample and article.
Install:
Install-Package Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger
Program.cs:
static void Main()
{
var config = new JobHostConfiguration
{
StorageConnectionString = "...",
DashboardConnectionString = "...."
};
config.UseGroupQueueTriggers();
var host = new JobHost(config);
host.RunAndBlock();
}
Function.cs:
//Receive 10 messages at one time
public static void MyFunction([GroupQueueTrigger("queue3", 10)]List<string> messages)
{
foreach (var item in messages)
{
Console.WriteLine(item);
}
}
Result:
How would I get that changed to a GroupQueueTrigger? Is it an easy change?
In my opinion, it is an easy change.
You could follow below steps:
1.Install the package Microsoft.Azure.WebJobs.Extensions.GroupQueueTrigger from Nuget Package manager.
2.Change the program.cs file enable UseGroupQueueTriggers.
3.Change the webjobs functions according to your old triggered function.
Note:The group queue message trigger must use list.
As my code sample shows:
public static void MyFunction([GroupQueueTrigger("queue3", 10)]List<string> messages)
This function will get 10 messages from the "queue3" one time, so in this function you could change the function loop the list of the messages and process them, then index them all together into Azure Search.
4.Publish your webjobs to azure web apps.