Checkpointing Performance Issue on azure event hubs using kafka.js - azure-eventhub

We are using kafka.js event client with event hubs and we are seeing performance issues with kafka.js. The event client was originally running on Azure SDK event client for event hubs and we recently upgraded to kafka.js.
The issue is it's taking longer time to checkpoint before committing the event.
Azure SDK used to take 15ms to checkpoint (using external Azure storage account outside of event hubs) vs new kafka.js event client relying on Azure event hubs internal storage for checkpointing but taking more than 1 second to checkpoint. Extremely slow.
We are using eachMessage and checkpointing after each message to ensure data consistency but Azure event hubs throttles 4 calls/second per partition and each checkpoint call takes close to 1 second due to shared internal checkpoint storage used by event hubs.
We tried using eachBatch with autoCommit but it seems to be committing entire batch of messages in case if it fails during batch processing.
Anyone noticed similar issue elsewhere? appreciate any feedback or suggestion to solve this.
Here is the code snippet:
/**
* #method consume()
*/
public async consume() {
await this.client.run({
eachBatchAutoResolve: this.configuration.eachBatchAutoResolve, // default to false so consumer does not auto-resolve the offsets
autoCommitThreshold: this.configuration.autoCommitThreshold,
partitionsConsumedConcurrently: this.configuration.partitionsConsumedConcurrently,
eachBatch: async ({ batch, resolveOffset, heartbeat }) => {
if (this.shutdownContext != null) {
// prevent processing new events during a shutdown.
return;
}
const { topic, partition } = batch;
const partitionId = `${topic}-${partition}`;
if (this.partitionTracker.isPartitionProcessing(partitionId)) {
// prevent concurrent event processing on the same partition.
return;
}
this.partitionTracker.startProcessing(partitionId);
this.genericLogger.debug(
`batch last offset: ${batch.lastOffset()} partitionId: ${partitionId} highWatermark: ${
batch.highWatermark
} offsetLag: ${batch.offsetLag()}`,
);
/* eslint-disable no-await-in-loop, no-restricted-syntax */
for (const message of batch.messages) {
let event: Event<any>;
const partitionKey = message.key ? message.key.toString() : message.key;
const { offset, timestamp } = message;
// build event context
const context: EventContext = {
topic,
partition,
partitionKey,
offset,
timestamp,
};
try {
// transform kafka message to event
event = await this.transform(message);
} catch (error) {
const id = uuid.v4();
this.genericLogger.crit(
`Event failed consumer transformation. Please check the Dead Letter topic for a version of the event:${id}`,
);
const poisonousEvent: Event<any> = {
id,
correlationId: { id, origin: uuid.v4() },
data: message.value,
type: 'ca.event.poisonousmessage.created.v1',
};
await this.dlqProducer.send(
poisonousEvent,
this.configuration.dlqTopic,
this.genericLogger,
);
try {
resolveOffset(message.offset);
await heartbeat();
// await commitOffsetsIfNecessary();
this.partitionTracker.stopProcessing(partitionId);
await this.shutdownCheck(context);
} catch (commitError) {
await this.signalShutdown(
ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
commitError,
);
await this.shutdown({
reason: ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
error: commitError,
eventContext: context,
});
}
break;
}
const contextLogger = getLoggerFactory(
this.configuration.loggerOptions,
'event-client-consumer',
)(getEventMetaData(event));
const eventMessageContext = `[eventHubName:${topic} partition:${partition} offset:${offset} partitionKey:${partitionKey} timestampUTC:${timestamp}]`;
contextLogger.debug(`received new message on ${eventMessageContext}.`);
try {
// heartbeat
await heartbeat();
} catch (error) {
error.message = `error while manually sending a heartbeat: ${error.message}`;
if (error.message.includes(REBALANCING)) {
contextLogger.warn(error.message);
} else {
contextLogger.error(error.message);
}
break;
}
try {
// process
await this.process(event, context, contextLogger, heartbeat);
resolveOffset(message.offset);
// commit
// await this.commit(
// event,
// context,
// contextLogger,
// message.offset,
// resolveOffset,
// commitOffsetsIfNecessary,
// );
} catch (error) {
if (error.message.includes(REBALANCING)) {
contextLogger.warn(
`rebalancing error during processing or committing: ${error.message}`,
);
throw error;
}
const exception = error as Exception;
switch (exception.type) {
case ProcessExceptionType.COMMIT: {
contextLogger.error(
`[Event Processing Error] failed to commit processed event on ${eventMessageContext}. error: ${error.message}.`,
error,
);
// failed to commit message after retry. sleeping.
await this.sleep(context, error as Error, exception.type, heartbeat);
// await this.signalShutdown(exception.type, error);
// await this.shutdown({ reason: exception.type, error, eventContext: context });
throw error;
}
case ProcessExceptionType.SHUTDOWN_NO_COMMIT: {
// shutdown with no commit. most likely due to non retryable circuit breaker exception.
await this.sleep(context, error as Error, exception.type, heartbeat);
// await this.signalShutdown(exception.type, error);
// await this.shutdown({ reason: exception.type, error, eventContext: context });
throw error;
}
case ProcessExceptionType.SLEEP_NO_COMMIT: {
// return early for sleep cycle.
throw error;
}
default: {
if (this.configuration.continueOnFailedEventProcessing) {
contextLogger.crit(
`[Event Processing Error] Error in user defined code when processing event on ${eventMessageContext} beyond process loop retry limit error: ${error.message}. Commiting event to continue processing new events.`,
error,
);
try {
resolveOffset(message.offset);
await heartbeat();
// await commitOffsetsIfNecessary();
this.partitionTracker.stopProcessing(partitionId);
await this.shutdownCheck(context);
} catch (commitError) {
await this.sleep(context, commitError as Error, exception.type, heartbeat);
// await this.signalShutdown(
// ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
// commitError,
// );
// await this.shutdown({
// reason: ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
// error: commitError,
// eventContext: context,
// });
}
} else {
contextLogger.crit(
`[Event Processing Error] Error in user defined code when processing event on ${eventMessageContext} beyond process loop retry limit error: ${error.message}. Initiating sleep.`,
error,
);
await this.sleep(context, error as Error, exception.type, heartbeat);
// await this.signalShutdown(
// ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
// error,
// );
// await this.shutdown({
// reason: ProcessExceptionType.FAILURE_TO_PROCESS_AND_COMMIT,
// error,
// eventContext: context,
// });
}
}
}
throw error;
}
}
this.genericLogger.debug(
`batch last offset: ${batch.lastOffset()} partitionId: ${partitionId} highWatermark: ${
batch.highWatermark
} offsetLag: ${batch.offsetLag()}`,
);
this.genericLogger.debug(
`resolved batch of ${batch.messages.length} messages partitionId: ${partitionId}`,
);
this.partitionTracker.stopProcessing(partitionId);
await this.shutdownCheck();
},
});
}

Related

Piping a API Gateway response to client through Lambda handler

I have a REST API using AWS API Gateway. The API is handled by a custom Lambda function. I have a /prompts endpoint in my API, for which the Lambda function will call Open AI API, send it the prompt, and stream the result to the user as it is being generated (which can take a few seconds).
I'm able to stream and handle the response from Open AI's API to my Lambda function.
I would now like to re-stream / pipe that response to the client.
My question is how to do that?
Is there a way to simply pipe the stream being received from Open AI API to my client?
My Lambda function is:
ry {
const res = await openai.createCompletion({
...params,
stream: true,
}, { responseType: 'stream' });
res.data.on('data', data => {
const lines = data.toString().split('\n').filter(line => line.trim() !== '');
for (const line of lines) {
const message = line.replace(/^data: /, '');
if (message === '[DONE]') {
// store the response to DynamoDB
storeRecord(content)
return content
}
try {
const parsed = JSON.parse(message);
content += parsed.choices[0].text
// ****** I want to send content to the front-end client... *******
} catch(error) {
console.error('Could not JSON parse stream message', message, error);
}
}
});
} catch (error) {
if (error.response?.status) {
console.error(error.response.status, error.message);
error.response.data.on('data', data => {
const message = data.toString();
try {
const parsed = JSON.parse(message);
console.error('An error occurred during OpenAI request: ', parsed);
} catch(error) {
console.error('An error occurred during OpenAI request: ', message);
}
});
} else {
console.error('An error occurred during OpenAI request', error);
}
}

Why sending different message to AWS SQS return same message ID

I want to send messages to a FIFO sqs queue. Given a array of list different user ids, for each id, I want to call sendMessage command to send the id as message body. I'm expecting every time it will return a different message id, but actually they all return same messageId. Sample code below:
const sendMessage = async (params:ISqsRequestParam) => {
try {
const sqsResponse = await sqsClient.send(new SendMessageCommand(params));
console.log(`send message response: ${JSON.stringify(sqsResponse)}`);
return sqsResponse.MessageId; // For unit tests.
} catch (err) {
console.error('SQS sending', err);
}
};
export const handler = async function (event: IEventBridgeAddtionalParams, context: Context): Promise<string[]> {
console.info(`${context.functionName} triggered at ${event.time} under ${process.env.NODE_ENV} mode`);
console.info(`customor parameter value is ${event.custom_parameter}`);
try {
const sqsUrl: string = event.custom_parameter === 'Creator' ? process.env.BATCH_CREATOR_QUEUE_URL : process.env.BATCH_PROCESSOR_QUEUE_URL;
console.info(`SQS Url is: ${sqsUrl}`);
const tenantData: ITenantResponse = await fetchAllTenantIds();
console.info(`ResponseDate from tenant service: ${JSON.stringify(tenantData.value)}`);
// change to fix tenantId for development environment for better debugging and test
const data : ITenantDetails[] = process.env.NODE_ENV !== 'production' ? fixedTenantDataForNonProd() : tenantData.value;
const promise = data.map(async tenantDetails => {
if (!tenantDetails.tenantFailed) {
console.info(`Tenant Id in message body: ${tenantDetails.id}`);
const params: ISqsRequestParam = {
MessageBody: tenantDetails.id,
MessageDeduplicationId: `FP_Tenant_populator_${event.custom_parameter}`, // Required for FIFO queues
MessageGroupId: `FP_Tenant_populator_Group_${event.custom_parameter}`, // Required for FIFO queues
QueueUrl: sqsUrl //SQS_QUEUE_URL; e.g., 'https://sqs.REGION.amazonaws.com/ACCOUNT-ID/QUEUE-NAME'
};
const messageId:string = await sendMessage(params);
return messageId; //for unit testing
} else {
return null; //for unit testing
}
});
const messageIds:string[] = await Promise.all(promise); //for unit testing
const activeMessageIds:string[] = messageIds.filter(id=> id!= null); //for unit testing
console.info(`Success, ${activeMessageIds.length} messages sent. MessageID:${activeMessageIds[0]}`);
return activeMessageIds; //for unit testing
} catch (error) {
console.error(`Fetch tenant details error: ${error}`);
}
};

'Transaction cannot be rolled back...' using unmanaged transactions

We use unmanaged transactions in many cases but this issue only occurs in 2 functions which are invoked more commonly and only on production environment (Haven't been able to reproduce it on dev).
Our code looks similar to this:
const t = await database.t();
try {
await function1(..., {t});
await function2(..., {t});
}
catch (e) {
await t.rollback(); <- This throws the error
throw e
};
// more logic
await t.commit();
Where database is just a Sequelize instantiation with name of database, username and password.
We assumed it's a connection error according to this: https://github.com/sequelize/sequelize/issues/4850 but in this case code wouldn't reach beyond line one on our pseudocode.
Error output on CloudWatch:
If you have multiple try-catch then maybe need to check the transaction state before rollback/commit:
const t = await database.t();
try {
await function1(..., { t });
await function2(..., { t });
}
catch (e) {
if (!t.finished) {
await t.rollback();
}
throw e;
};
// more logic
await t.commit();
Reference links:
https://github.com/sequelize/sequelize/issues/6547#issuecomment-466016971
https://github.com/sequelize/sequelize/pull/5043/files#diff-6c5ddfc7d68c447e32ef4c38b7ed69628910355ea0aff735bd4bcecc1256a8d8

Self invoking lambda invocation timing out

We're trying to develop a self-invoking lambda to process S3 files in chunks. The lambda role has the policies needed for the invocation attached.
Here's the code for the self-invoking lambda:
export const processFileHandler: Handler = async (
event: S3CreateEvent,
context: Context,
callback: Callback,
) => {
let bucket = loGet(event, 'Records[0].s3.bucket.name');
let key = loGet(event, 'Records[0].s3.object.key');
let totalFileSize = loGet(event, 'Records[0].s3.object.size');
const lastPosition = loGet(event, 'position', 0);
const nextRange = getNextSizeRange(lastPosition, totalFileSize);
context.callbackWaitsForEmptyEventLoop = false;
let data = await loadDataFromS3ByRange(bucket, key, nextRange);
await database.connect();
log.debug(`Successfully connected to the database`);
const docs = await getParsedDocs(data, lastPosition);
log.debug(`upserting ${docs.length} records to database`);
if (docs.length) {
try {
// upserting logic
log.debug(`total documents added: ${await docs.length}`);
} catch (err) {
await recurse(nextRange.end, event, context);
log.debug(`error inserting docs: ${JSON.stringify(err)}`);
}
}
if (nextRange.end < totalFileSize) {
log.debug(`Last ${context.getRemainingTimeInMillis()} milliseconds left`);
if (context.getRemainingTimeInMillis() < 10 * 10 * 10 * 6) {
log.debug(`Less than 6000 milliseconds left`);
log.debug(`Invoking next iteration`);
await recurse(nextRange.end, event, context);
callback(null, {
message: `Lambda timed out processing file, please continue from LAST_POSITION: ${nextRange.start}`,
});
}
} else {
callback(null, { message: `Successfully completed the chunk processing task` });
}
};
Where recurse is an invocation call to the same lambda. Rest of the things work as expected it just times out whenever the call stack comes on this invocation request:
const recurse = async (position: number, event: S3CreateEvent, context: Context) => {
let newEvent = Object.assign(event, { position });
let request = {
FunctionName: context.invokedFunctionArn,
InvocationType: 'Event',
Payload: JSON.stringify(newEvent),
};
let resp = await lambda.invoke(request).promise();
console.log('Invocation complete', resp);
return resp;
};
This is the stack trace logged to CloudWatch:
{
"errorMessage": "connect ETIMEDOUT 63.32.72.196:443",
"errorType": "NetworkingError",
"stackTrace": [
"Object._errnoException (util.js:1022:11)",
"_exceptionWithHostPort (util.js:1044:20)",
"TCPConnectWrap.afterConnect [as oncomplete] (net.js:1198:14)"
]
}
Not a good idea to create a self-invoking lambda function. In case of an error (could also be a bad handler call on AWS side) a lambda function might re-run several times. Very hard to monitor and debug.
I would suggest using Step Functions. I believe this tutorial can help Iterating a Loop Using Lambda
From the top of my head, if you prefer not dealing with Step Functions, you could create a Lambda trigger for an SQS queue. Then you pass a message to the queue if you want to run the lambda function another time.

AWS javascript SDK request.js send request function execution time gradually increases

I am using aws-sdk to push data to Kinesis stream.
I am using PutRecord to achieve realtime data push.
I am observing same delay in putRecords as well in case of batch write.
I have tried out this with 4 records where I am not crossing any shard limit.
Below is my node js http agent configurations. Default maxSocket value is set to infinity.
Agent {
domain: null,
_events: { free: [Function] },
_eventsCount: 1,
_maxListeners: undefined,
defaultPort: 80,
protocol: 'http:',
options: { path: null },
requests: {},
sockets: {},
freeSockets: {},
keepAliveMsecs: 1000,
keepAlive: false,
maxSockets: Infinity,
maxFreeSockets: 256 }
Below is my code.
I am using following code to trigger putRecord call
event.Records.forEach(function(record) {
var payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
// put record request
evt = transformEvent(payload );
promises.push(writeRecordToKinesis(kinesis, streamName, evt ));
}
Event structure is
evt = {
Data: new Buffer(JSON.stringify(payload)),
PartitionKey: payload.PartitionKey,
StreamName: streamName,
SequenceNumberForOrdering: dateInMillis.toString()
};
This event is used in put request.
function writeRecordToKinesis(kinesis, streamName, evt ) {
console.time('WRITE_TO_KINESIS_EXECUTION_TIME');
var deferred = Q.defer();
try {
kinesis.putRecord(evt , function(err, data) {
if (err) {
console.warn('Kinesis putRecord %j', err);
deferred.reject(err);
} else {
console.log(data);
deferred.resolve(data);
}
console.timeEnd('WRITE_TO_KINESIS_EXECUTION_TIME');
});
} catch (e) {
console.error('Error occured while writing data to Kinesis' + e);
deferred.reject(e);
}
return deferred.promise;
}
Below is output for 3 messages.
WRITE_TO_KINESIS_EXECUTION_TIME: 2026ms
WRITE_TO_KINESIS_EXECUTION_TIME: 2971ms
WRITE_TO_KINESIS_EXECUTION_TIME: 3458ms
Here we can see gradual increase in response time and function execution time.
I have added counters in aws-sdk request.js class. I can see same pattern in there as well.
Below is code snippet for aws-sdk request.js class which executes put request.
send: function send(callback) {
console.time('SEND_REQUEST_TO_KINESIS_EXECUTION_TIME');
if (callback) {
this.on('complete', function (resp) {
console.timeEnd('SEND_REQUEST_TO_KINESIS_EXECUTION_TIME');
callback.call(resp, resp.error, resp.data);
});
}
this.runTo();
return this.response;
},
Output for send request:
SEND_REQUEST_TO_KINESIS_EXECUTION_TIME: 1751ms
SEND_REQUEST_TO_KINESIS_EXECUTION_TIME: 1816ms
SEND_REQUEST_TO_KINESIS_EXECUTION_TIME: 2761ms
SEND_REQUEST_TO_KINESIS_EXECUTION_TIME: 3248ms
Here you can see it is increasing gradually.
Can anyone please suggest how can I reduce this delay?
3 seconds to push single record to Kinesis is not at all acceptable.