Multipart file upload failing some times on docker container on AWS - amazon-web-services

I have an hapi server running on AWS docker container and it exposes a file upload API. This API runs smoothly on my local machine, but when deployed to AWS it fails some times with an error "Incomplete multipart payload". The error does not occur always, but only at some times.
The images which I am uploading are small in size(less than 100 kb) only and this failure is not because of slow network as I have tested it on multiple networks.
After debugging hapi modules for payload parsing, I have found that Pez module who is parsing the payload is throwing this error. I also noticed that when this error happens Pez modules onClose event is called and none of the parse events occurs and hence it returns the "Incomplete multipart payload" error. The Pez state is at "preamble" when this happens, for successful parse case, the state is "epilogue".
My hapi route config is
config: {
payload: {
maxBytes: 20971520,
output: 'data',
parse: true,
allow: 'multipart/form-data'
}
}
Can somebody suggest why is the parsing fails at times or why the onClose event is called before parsing happens?

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Can't reach local DynamoDB from Lambda in AWS SAM

I have a simple AWS SAM setup with a Go lambda function. It's using https://github.com/aws/aws-sdk-go-v2 for dealing with DynamoDB, the endpoint is set via environment variables from the template. (Here: https://github.com/abtercms/abtercms2/blob/main/template.yaml)
My problem is that I can't get my lambda to return anything from DynamoDB, as a matter of fact I don't think it can reach it.
{
"level":"error",
"status":500,
"error":"failed to fetch item (status 404), err: operation error DynamoDB: GetItem, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"http://127.0.0.1:8000/\": dial tcp 127.0.0.1:8000: connect: connection refused",
"path":"GET /websites/abc",
"time":1658243299,
"message":"..."}
I'm able to reach the local DynamoDB just fine from the host machine.
Results were the same both for running DynamoDB in docker compose and via the .jar file, so the problem is not there as far as I can tell.
Also feel free to check out the whole project, I'm just running make sam-local and make curl-get-website-abc above. The code is here: https://github.com/abtercms/abtercms2.

Unable to start amplify mock InternalFailure: The request processing has failed because of an unknown error, exception or failure

I have just tried to start my amplify mock service and received the follow error:
InternalFailure: The request processing has failed because of an unknown error, exception or failure.
This has previously worked, a few hours ago with no resets or other changes.
To fix this, I did have success with removing amplify completely, doing amplify init & amplify add api but this means I lose my local data each time, but it happens randomly multiple times in the last few hours.
For the full log while error is taking place:
hutber#hutber:/var/www/unsal.co.uk$ amplify mock
GraphQL schema compiled successfully.
Edit your schema at /var/www/unsal.co.uk/amplify/backend/api/unsalcouk/schema.graphql or place .graphql files in a directory at /var/www/unsal.co.uk/amplify/backend/api/unsalcouk/schema
Failed to start API Mock endpoint InternalFailure
the problem probably comes from the SQLite file use for the mock (lock stories I guess). Delete the file in the mock-data/dynamodb/your_db_file.db folder and execute the again amplify mock api. The file recreates itself correctly. This avoids resetting the whole amplify project.

"LAMBDA_RUNTIME" Error on high-volume Lambda Function

I'm currently using a Lambda Function written in Javascript that is setup with an SQS event source to automatically pull messages from an SQS Queue and do some basic processing on the message contents. I cannot show the code but the summary of the lambda function's execution is basically:
For each message in the batch it receives as part of the event:
It parses the body, which is a JSON string, into a Javascript object.
It reads an object from S3 that is listed in the object using getObject.
It puts a record into a DynamoDB table using put.
If there were no errors, it deletes the individual SQS message that was processed from the Queue using deleteMessage.
This SQS queue is high-volume and receives messages in-bulk, regularly building up a backlog of millions of messages. The Lambda is normally able to scale to process hundreds of thousands of messages concurrently. This solution has worked well for me with other applications in the past but I'm now encountering the following intermittent error that reliably begins to appear as the Lambda scales up:
[ERROR] [#############] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
I've been unable to find any information anywhere about what this error means and what causes it. There appears to be not discernible pattern as to which executions encounter it. The function is usually able to run for a brief period without encountering the error and scale to expected levels. But then, as you can see, the error starts to appear quite suddenly and completely destroys the Lambda throughput by forcing it to auto-scale down:
Does anyone know what this "LAMBDA_RUNTIME" error means and what might cause it? My Lambda Function runtime is Node v12.
Your function is being invoked asynchronously, so when it finishes it signals the caller if it was sucessful.
You should have an error some milliseconds earlier, probably an unhandled exception not being logged. If that's the case, your functions ends without knowing about the exception and tries to post a success response.
I have this error only that I get:
[ERROR] [1638918279694] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413.
I went to the lambda function on aws console and ran the test with a custom event I build and the error I got there was:
{
"errorMessage": "Response payload size exceeded maximum allowed payload size (6291556 bytes).",
"errorType": "Function.ResponseSizeTooLarge"
}
So this is the actual error that cloudwatch doesn't return but the testing section of the lambda function console do.
I think I'll have to return info to an S3 file or something, but that's another matter.

Random ssl handshake failure when pulling file from Amazon s3 bucket

I have a specific fetch request in my node app which just pulls a json file from my S3 bucket and stores the content within the state of my app.
The fetch request works 99% of the time but for some reason about every 4 or 5 days I get a notification saying the app has crashed and when I investigate the reason is always because of this ssl handshake failure.
I am trying to figure out why this is happening as well as a fix to prevent this in future cases.
The fetch request looks like the following and is called everything someone new visits the site, Once the request has been made and the json is now in the app's state, the request is no longer called.
function grabPreParsedContentFromS3 (prefix, callback) {
fetch(`https://s3-ap-southeast-2.amazonaws.com/my-bucket/${prefix}.json`)
.then(res => res.json())
.then(res => callback(res[prefix]))
.catch(e => console.log('Error fetching data from s3: ', e))
}
When this error happens the .catch method with get thrown and returns the following error message:
Error fetching data from s3: {
FetchError: request to https://s3-ap-southeast-2.amazonaws.com/my-bucket/services.json failed,
reason: write EPROTO 139797093521280:error:1409E0E5:SSL routines:ssl3_write_bytes:ssl handshake failure:../deps/openssl/openssl/ssl/s3_pkt.c:659:
...
}
Has anyone encountered this kind of issue before or has any idea why this might be happening? Currently I am wondering if maybe there is a limit to the amount of S3 request i can make at one time which is causing it to fail? But the site isn't super popular either to be request a huge portion of fetches.