Retrieving the progress of getObject (aws-sdk) - amazon-web-services

I'm using node.js with the aws-sdk (for S3).... When I am downloading a huge file from s3, how can I regularly retrieve the progress of the download so that the front-end can show a progress bar? Currently I am using getObject. (https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property)
The code to download the file works. Here's a snippet of my code...
return await new Promise((resolve, reject) => {
this.s3.getObject(params, (error, data) => {
if (error) {
reject(error);
} else {
resolve(data.Body);
}
});
I'm just not sure how to hook into the progress as it's downloading. Thanks in advance for any insight!

You can utilize S3 byte-range fetching which allows the fetching of small parts of a file in S3. This capability then allows us to fetch large objects by dividing the file download into multiple parts which brings the following advantages:
Part download failure does not require full re-downloading of the file.
Download pause/resume capability.
Download progress tracking
Retry packets that failed or interrupted by network issues
Sniff headers located in the first few bytes of the file if we just need to get metadata from the files.
You can split the file download by your size of choice (I propose 1-4mb at a time) and download the parts chunk by chunk, when each of the get object promises complete, you can trace how many have completed. A good start is by looking at the AWS documentation.

STREAMING OPTION
Another option is to use a stream and track the amount of bytes received:
const { ContentLength: contentLength } = await s3.headObject(params).promise();
const rs = s3.getObject(s3Params).createReadStream();
let progress = 0;
rs.on('data', function (chunk) {
// Advance your progress by chunk.length
progress += chunk.length;
console.log(`Progress: ${progress / contentLength}%`);
});
// ... pipe to write stream

Related

Invoking binary in aws lambda with rust

So I have the following rust aws lambda function:
use std::io::Read;
use std::process::{Command, Stdio};
use lambda_http::{run, service_fn, Body, Error, Request, RequestExt, Response};
use lambda_http::aws_lambda_events::serde_json::json;
/// This is the main body for the function.
/// Write your code inside it.
/// There are some code example in the following URLs:
/// - https://github.com/awslabs/aws-lambda-rust-runtime/tree/main/examples
async fn function_handler(_event: Request) -> Result<Response<Body>, Error> {
// Extract some useful information from the request
let program = Command::new("./myProggram")
.stdout(Stdio::piped())
.output()
.expect("failed to execute process");
let data = String::from_utf8(program.stdout).unwrap();
let parsed = data.split("\n").filter(|x| !x.is_empty()).collect::<Vec<&str>>();
// Return something that implements IntoResponse.
// It will be serialized to the right response event automatically by the runtime
let resp = Response::builder()
.status(200)
.header("content-type", "application/json")
.body(json!(parsed).to_string().into())
.map_err(Box::new)?;
Ok(resp)
}
#[tokio::main]
async fn main() -> Result<(), Error> {
tracing_subscriber::fmt()
.with_max_level(tracing::Level::INFO)
// disable printing the name of the module in every log line.
.with_target(false)
// disabling time is handy because CloudWatch will add the ingestion time.
.without_time()
.init();
run(service_fn(function_handler)).await
}
The idea here is that I want to return the response from the binary in JSON format.
I'm compiling the function with cargo lambda which is producing bootstrap file, then I'm zipping it manually by including the bootstrap binary and the myProgram binary.
When I test my function in the lambda panel by sending event to it I get the response with the right headers etc. but the response body is empty.
I'm deploying my function thru the aws panel, on Custom runtime on Amazon Linux 2 by uploading the zip file.
When I test locally with cargo lambda watch and cargo lambda invoke the response body is filled with the myProgram stdout parsed to json.
Any ideas or thoughts on what goes wrong in the actual cloud are much appreciated!
My problem was with the dynamically linked libraries in the binary. It is actually python binary and it was missing specific version of GLIBC.
The easiest solution in my case was to compile myProgram on Amazon Linux 2

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

RaiBlocks blockchain (Nano) local client (rai_node –daemon) cannot open new account

I run a raiblocks blockchain node on Linux:
./rai_node --daemon
I created an account and from NANEX made a transfer to that account of some NANO coins.
I checked in the explorer
https://www.nanode.co
and it shows the payment and the HASH, and shows it as un-pocketed yet.
To complete the opening of the new account, I followed the steps of the manual below, but it fails:
https://www.nanode.co/docs#rpc-guide-connecting
In step 1 of the above manual:
The RPC command “accounts_pending” returns an empty string and not the HASH that appears in the Explorer above.
For the RPC command I used both “curl” and the nodejs library “axios” that writes in the online manual.
With “curl”:
curl -g -d
‘{“action”: “accounts_pending”,“accounts”: [“xrb_acountnumber”],“count”: 1}’
‘[::ffff:127.0.0.1]:7076’
With nodejs:
// Step 1. Retrieve hash of the send block that you sent from your wallet
const pending = await rpc.post(’/’, {
action: ‘accounts_pending’,
accounts: [“xrb_acountnumber”]
})
The above are not working even after I manually downloaded the Nano_blockchain into my local client and it is almost synchronized.
Also in the step no 4:
// Step 4. Publish your open block to the network using "process"
const processResult = await rpc.post(’/’, {
action: ‘process’,
block: newBlock.data.block
})
console.log(processResult.data.hash) // The hash of your newly published open block
it returns the error:
data: { error: ‘Gap source block’ } }
instead of the hash of your newly published open block
Any help would very much appreciated

Thumbnail the first page of a pdf from a stream in GraphicsMagick

I know how to use GraphicsMagick to make a thumbnail of the first page of a pdf if I have a pdf file and am running gm locally. I can just do this:
gm(pdfFileName + "[0]")
.background("white")
.flatten()
.resize(200, 200)
.write("output.jpg", (err, res) => {
if (err) console.log(err);
});
If I have a file called doc.pdf then passing doc.pdf[0] to gm works beautifully.
But my problem is I am generating thumbnails on an AWS Lambda function, and the Lambda takes as input data streamed from a source S3 bucket. The relevant slice of my lambda looks like this:
// Download the image from S3, transform, and upload to a different S3 bucket.
async.waterfall([
function download(next) {
s3.getObject({
Bucket: sourceBucket,
Key: sourceKey
},
next);
},
function transform(response, next) {
gm(response.Body).size(function(err, size) { // <--- gm USED HERE
.
.
.
Everything works, but for multipage pdfs, gm is generating a thumbnail from the last page of the pdf. How do I get the [0] in there? I did not see a page selector in the gm documentation as all their examples used filenames, not streams I believe there should be an API, but I have not found one.
(Note: the [0] is really important not only because the last page of multipage PDFs are sometimes blank, but I noticed when running gm on the command line with large pdfs, the [0] returns very quickly while without the [0] the whole pdf is scanned. On AWS Lambda, it's important to finish quickly to save on resources and avoid timeouts!)
You can use .selectFrame() method, which is equivalent to specifying [0] directly in file name.
In your code:
function transform(response, next) {
gm(response.Body)
.selectFrame(0) // <--- select the first page
.size(function(err, size) {
.
.
.
Don't get confused about the name of function. It work not only with frames for GIFs, but also works just fine with pages for PDFs.
Checkout this function source on GitHub.
Credits to #BenFortune for his answer to similar question about GIFs first frame. I've took it as inspiration and tested this solution with PDFs, it actually works.
Hope it helps.

Postman - how to loop request until I get a specific response?

I'm testing API with Postman and I have a problem:
My request goes to sort of middleware, so either I receive a full 1000+ line JSON, or I receive PENDING status and empty array of results:
{
"meta": {
"status": "PENDING",
"missing_connectors_count": 0,
"xxx_type": "INTERNATIONAL"
},
"results": []
}
The question is, how to loop this request in Postman until I will get status SUCCESS and results array > 0?
When I'm sending those requests manually one-by-one it's ok, but when I'm running them through Collection Runner, "PENDING" messes up everything.
I found an awesome post about retrying a failed request by Christian Baumann which allowed me to find a suitable approach to the exact same problem of first polling the status of some operation and only when it's complete run the actual tests.
The code I'd end up if I were you is:
const maxNumberOfTries = 3; // your max number of tries
const sleepBetweenTries = 5000; // your interval between attempts
if (!pm.environment.get("tries")) {
pm.environment.set("tries", 1);
}
const jsonData = pm.response.json();
if ((jsonData.meta.status !== "SUCCESS" && jsonData.results.length === 0) && (pm.environment.get("tries") < maxNumberOfTries)) {
const tries = parseInt(pm.environment.get("tries"), 10);
pm.environment.set("tries", tries + 1);
setTimeout(function() {}, sleepBetweenTries);
postman.setNextRequest(request.name);
} else {
pm.environment.unset("tries");
// your actual tests go here...
}
What I liked about this approach is that the call postman.setNextRequest(request.name) doesn't have any hardcoded request names. The downside I see with this approach is that if you run such request as a part of the collection, it will be repeated a number of times, which might bloat your logs with unnecessary noise.
The alternative I was considering is writhing a Pre-request Script which will do polling (by sending a request) and spinning until the status is some kind of completion. The downside of this approach is the need for much more code for the same logic.
When waiting for services to be ready, or when polling for long-running job results, I see 4 basic options:
Use Postman collection runner or newman and set a per-step delay. This delay is inserted between every step in the collection. Two challenges here: it can be fragile unless you set the delay to a value the request duration will never exceed, AND, frequently, only a small number of steps need that delay and you are increasing total test run time, creating excessive build times for a common build server delaying other pending builds.
Use https://postman-echo.com/delay/10 where the last URI element is number of seconds to wait. This is simple and concise and can be inserted as a single step after the long running request. The challenge is if the request duration varies widely, you may get false failures because you didn't wait long enough.
Retry the same step until success with postman.setNextRequest(request.name);. The challenge here is that Postman will execute the request as fast as it can which can DDoS your service, get you black-listed (and cause false failures), and chew up a lot of CPU if run on a common build server - slowing other builds.
Use setTimeout() in a Pre-request Script. The only downside I see in this approach is that if you have several steps needing this logic, you end up with some cut & paste code that you need to keep in sync
Note: there are minor variations on these - like setting them on a collection, a collection folder, a step, etc.
I like option 4 because it provides the right level of granularity for most of my cases. Note that this appears to be the only way to "sleep" in a Postman script. Now standard javascript sleep methods like a Promise with async and await are not supported and using the sandbox's lodash _.delay(function() {}, delay, args[...]) does not keep script execution on the Pre-request script.
In Postman standalone app v6.0.10, set your step Pre-request script to:
console.log('Waiting for job completion in step "' + request.name + '"');
// Construct our request URL from environment variables
var url = request['url'].replace('{{host}}', postman.getEnvironmentVariable('host'));
var retryDelay = 1000;
var retryLimit = 3;
function isProcessingComplete(retryCount) {
pm.sendRequest(url, function (err, response) {
if(err) {
// hmmm. Should I keep trying or fail this run? Just log it for now.
console.log(err);
} else {
// I could also check for response.json().results.length > 0, but that
// would omit SUCCESS with empty results which may be valid
if(response.json().meta.status !== 'SUCCESS') {
if (retryCount < retryLimit) {
console.log('Job is still PENDING. Retrying in ' + retryDelay + 'ms');
setTimeout(function() {
isProcessingComplete(++retryCount);
}, retryDelay);
} else {
console.log('Retry limit reached, giving up.');
postman.setNextRequest(null);
}
}
}
});
}
isProcessingComplete(1);
And you can do your standard tests in the same step.
Note: Standard caveats apply to making retryLimit large.
Try this:
var body = JSON.parse(responseBody);
if (body.meta.status !== "SUCCESS" && body.results.length === 0){
postman.setNextRequest("This_same_request_title");
} else {
postman.setNextRequest("Next_request_title");
/* you can also try postman.setNextRequest(null); */
}
I was searching for an answer to the same question and thought of a possible solution as I was reading your question.
Use postman workflow to rerun your request every time you don't get the response you're looking for. Anyway, that's what I'm gonna try.
postman.setNextRequest("request_name");
https://www.getpostman.com/docs/workflows
I didn't succeed to find the complete guidelines for this issue that's why I decided to invest some time and to describe all steps of the process from A to Z.
I will be observing an example where we will need to pass through transaction ids and in each iteration to change query param for next transaction id from the list.
Step 1. Prepare your request
https://some url/{{queryParam}}
Add {{queryParam}} variable for changing it from pre-request script.
If you need a token for request you should add it here, in Authorization tab.
Save request to collection (Save button in the right corner). For demonstration purpose I will use "Transactions Request" name. We will need to use this name later on.
Step 2. Prepare pre-request script
In postman use tab Pre-request Script to change transactionId variable from query param to actual transaction id.
let ids = pm.collectionVariables.get("TransactionIds");
ids = JSON.parse(ids);
const id = ids.shift();
console.log('id', id)
postman.setEnvironmentVariable("transactionId", id);
pm.collectionVariables.set("TransactionIds", JSON.stringify(ids));
pm.collectionVariables.get - gets array of transaction ids from collection variables. We will set it up in Step 4.
ids.shift() - we use it to remove id that we will use from our ids list (to prevent running twice on the same id)
postman.setEnvironmentVariable("transactionId", id) - change transaction id from query param to actual transaction id
pm.collectionVariables.set("TransactionIds", JSON.stringify(ids)) - we are setting up a new collection of variables that now does not include the id that was handled.
Step 3. Prepare Tests
In postman use tab Tests to create a loop logic. Tests will be executed after the request execution, so we can use it to make next request.
let ids = pm.collectionVariables.get("TransactionIds");
ids = JSON.parse(ids);
if (ids && ids.length > 0){
console.log('length', ids.length);
postman.setNextRequest("Transactions Request");
} else {
postman.setNextRequest(null);
}
postman.setNextRequest("Transactions Request") - calls a new request, in this case it will call the "Transactions Request" request
Step 4. Run Collections
In Postman from the left side bar you should choose Collections (click on it) and then choose a tab Variables.
This is the collection variables. In our example we used TransactionIds as a variable, so put in Current Value the array of transaction ids on which you want to loop.
Now you can click on Run (the button from right corner, near Save button) to run our loop requests.
You will be proposed to choose on which request you want to perform an action. Choose the request that we’ve created "Transactions Request".
It will run our request with pre-request script and with logic that we’ve set in Tests. In the end postman will open a new window with summary of our run.