Grabbing items from an S3 bucket - amazon-web-services

So I want to get a list of all the objects in my S3 bucket. I've just put it in an express route application I quickly setup (doesn't really matter it's in express just what i'm comfortable with).
So i'm doing :
var allObjs = [];
s3.listObjects({Bucket: 'myBucket'}, function(err, data) {
var stringifiedObjs = JSON.stringify(allObjs);
fs.writeFile("test", stringifiedObjs, function(err) {})
}
Which grabs my objects, stringifys them and writes them to a file called test. The issue i'm having is that it's only getting 1,000 results.
I read somewhere (I can't find where) that AWS limits you to 1,000 results per call.
How can I rerun this and grab the next 1,000? But so make sure that it's the next incremented 1,000 not still the first one?
In short, how can I get every object in my S3 bucket? I've been getting lost in the documentation.
Thank you!
EDIT
This is my object I get back :
{ Key: 'bucket_path/e11_19_9a31mv3ot51tm384grjd6rdj51boxx_q_q112.png',
LastModified: Sat Apr 23 2016 09:16:23 GMT+0100 (BST),
ETag: '"7d50fsdfsd4sda159b32cf85c683c5924"',
Size: 704222,
StorageClass: 'STANDARD',
Owner:
{ DisplayName: 'servers',
ID: '58af203151c51eddf2sdfs411e0b91d274a8fda23f58280f9b06371e436f7' } },

You need to set the marker property as the last element of the previous get
check the documentation as reference (as you already did :) )
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html

When you receive your response from the listObjects call, your response should include 2 very important fields in the data property:
IsTruncated - True if there are more keys to return. False otherwise.
NextMarker - The value to use for the Marker property in the next call to listObjects.
So after you call listObjects, you need to check the IsTruncated field to see if it's True. If it is, then feed the value from NextMarker into the value for Marker and call listObjects again.
Update:
It would appear that AWS.Request object has an .eachPage method which can be used to automatically make multiple calls. So there is a magical function to do this work for you.
var pages = 1;
s3.listObjects().eachPage(function(err, data) {
if (err) return;
console.log("Page", pages++);
console.log(data);
});
Source: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Request.html

Related

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?
since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

Is it possible to invoke lambda only on item expiry from Dynamo

I have set up Dynamo table and have enabled stream and also enabled TTL (timetolive) on one of the columns. I also have one lambda which will pull entry from Dynamo Stream.
Now either I add, delete, or edit, Or TTL gets expired - all this will cause the lambda invocation.
I am not interested in add or edit event, I only want stream to receive the deleted, TTL expired entries, is this possible?
Also, I can definitely put a check in my lambda code and process only when event type of "delete", but still lambda invocation for add, edit will take place regardless. Kindly guide
You can't control DynamoDB stream, it will always post events for any changes happened to your table, however you can control the lambda invocation by adding property "FilterCriteria" in your "EventSourceMapping"
https://docs.aws.amazon.com/lambda/latest/dg/invocation-eventfiltering.html
FilterCriteria: {"Filters": [{"Pattern": "{ \"userIdentity\": { \"type\": [ \"Service\" ] ,\"principalId\": [\"dynamodb.amazonaws.com\"] }}"}]}
using above filter your lambda will be only invoked if TTL expiry event posted in the DymnamoDB stream.
Sadly, you can't make DynamoDB stream, to stream only deletion or expiration of items. Everything is streamed, and its up to your lambda function to filter the events of interests.
For TTL expired items, your function needs to check:
"userIdentity":{
"type":"Service",
"principalId":"dynamodb.amazonaws.com"
}
An alternative way, is to have second table, only with TTL markers. This could be useful, if your primary table experiences a lot of updates and modifications. This way, the stream on your second table would only invoke your function twice for each item, i.e. creation and TTL expiration, rather then for all the updates you are not interested in.
Haven't found a way to trigger lambdas only on TTL expiracy, but you can recon TTL from the event record payload you receive, checking the properties eventName + userIdentity as said above
Records: [
{
eventID: '36df5f15e7429cc986999f68349e6fef',
eventName: 'REMOVE',
eventVersion: '1.1',
eventSource: 'aws:dynamodb',
awsRegion: 'us-west-2',
dynamodb: [Object],
userIdentity: [Object],
eventSourceARN: 'arn:aws:dynamodb:us-west-2:XXXXX:table/table-with-ttl/stream/2021-09-27T17:03:42.812'
}
]
if (record.eventName === 'REMOVE') {
// Check if deletion was done manually or triggered by the TTL timer.
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-streams.html
if (
record.userIdentity &&
record.userIdentity.type === 'Service' &&
record.userIdentity.principalId === 'dynamodb.amazonaws.com'
) {
const itemThatWasRemoved = unmarshall(record.dynamodb.OldImage)
// Your code that only runs for TTL removals here
}

How to identify the first execution of a Lambda version at runtime?

I want to run some code only on the first execution of a Lambda version. (NB: I'm not referring to a cold start scenario as this will occur more than once).
computeInstanceInvocationCount is unfortunately reset to 0 upon every cold start.
functionVersion is an available property but unless I store this in memory outside the lambda I cannot calculate if it is indeed the first execution.
Is it possible to deduce this based on runtime values in event or context? Or is there any other way?
There is no way of knowing if this is the first time that a Lambda has ever run from any information passed into the Lambda.
You would have to include functionality to check elsewhere by setting a flag or parameter there, remember though that multiple copies of the Lambda could be invoked at the same time so any data store for this would presumably need to be transactional to ensure that it occurs only once.
One way that you can try is to use AWS parameter store.
On every deployment update the parameter store value with
{"version":"latest","is_firsttime":true}
So run the below command after deployment
aws secretsmanager update-secret --secret-id demo --secret-string '{"version":"latest","is_firsttime":true}'
So this is something that we need to make sure before deployment.
Now we can set logic inside the lambda, in the demo we will look into is_firsttime only.
var AWS = require('aws-sdk'),
region = "us-west-2",
secretName = "demo",
secret,
decodedBinarySecret;
var client = new AWS.SecretsManager({
region: region
});
client.getSecretValue({SecretId: secretName}, function(err, data) {
secret = data.SecretString;
secret=JSON.parse(secret)
if ( secret.is_firsttime == true)
{
console.log("lambda is running first time")
// any init operation here
// init completed, now we are good to set back it `is_firsttime` to false
var params = {
Description: "Init completeed, updating value at date or anythign",
SecretId: "demo",
SecretString : '[{ "version" : "latest", "is_firsttime": false}]'
};
client.updateSecret(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
}
else{
console.log("init already completed");
// rest of logic incase of not first time
}
})
This is just a demo code that will work in a non-lambda environment, adjust it accordingly.
Expected response for first time
{
ARN: 'arn:aws:secretsmanager:us-west-2:12345:secret:demo-0Nlyli',
Name: 'demo',
VersionId: '3ae6623a-1111-4a41-88e5-12345'
}
Second time
init already completed

Google Cloud Datastore query values in array

I have entities that look like that:
{
name: "Max",
nicknames: [
"bestuser"
]
}
how can I query for nicknames to get the name?
I have created the following index,
indexes:
- kind: users
properties:
- name: name
- name: nicknames
I use the node.js client library to query the nickname,
db.createQuery('default','users').filter('nicknames', '=', 'bestuser')
the response is only an empty array.
Is there a way to do that?
You need to actually fetch the query from datastore, not just create the query. I'm not familiar with the nodejs library, but this is the code given on the Google Cloud website:
datastore.runQuery(query).then(results => {
// Task entities found.
const tasks = results[0];
console.log('Tasks:');
tasks.forEach(task => console.log(task));
});
where query would be
const query = db.createQuery('default','users').filter('nicknames', '=', 'bestuser')
Check the documentation at https://cloud.google.com/datastore/docs/concepts/queries#datastore-datastore-run-query-nodejs
The first point to notice is that you don't need to create an index to this kind of search. No inequalities, no orders and no projections, so it is unnecessary.
As Reuben mentioned, you've created the query but you didn't run it.
ds.runQuery(query, (err, entities, info) => {
if (err) {
reject(err);
} else {
response.resultStatus = info.moreResults;
response.cursor = info.moreResults == TNoMoreResults? null: info.endCursor;
resolve(entities);
};
});
In my case, the response structure was made to collect information on the cursor (if there is more data than I've queried because I've limited the query size using limit) but you don't need to anything more than the resolve(entities)
If you are using the default namespace you need to remove it from your query. Your query needs to be like this:
const query = db.createQuery('users').filter('nicknames', '=', 'bestuser')
I read the entire plob as a string to get the bytes of a binary file here. I imagine you simply parse the Json per your requirement

In AWS SQS, Attribute is visible in console and not able to read it programatically

I am trying to insert a token in SQS using "AWS Service Proxy" in Web API with Path Override given as below :
Account#/QueueName?Action=SendMessage&MessageAttribute.1.Name="TEST"&MessageAttribute.1.Value.StringValue="abcd"&MessageAttribute.1.Value.DataType=String&Version=2012-11-05&Expires=2100-05-05T22%3A52%3A43PST
I can see this attribute "TEST" created with value "abcd" in the console of SQS but when I try to retrieve it thru code, I am not able to get the attribute "TEST". Code I am using to retrieve is given below"
sqs.receiveMessage({
QueueUrl: sqsQueueUrl,
MaxNumberOfMessages: 3, // how many messages do we wanna retrieve?
VisibilityTimeout: 60, // seconds - how long we want a lock on this job
WaitTimeSeconds: 3, // seconds - how long should we wait for a message?
AttributeNames: ["All"]
}, function (err, data) {
// If there are any messages to get
if (data.Messages) {
// Get the first message (should be the only one since we said to only get one above)
var message = data.Messages[0],
body = JSON.parse(message.Body);
// Now this is where you'd do something with this message
context.done("Message Found");
}
});
Please let me know what I am missing. Thanks in advance.
Nowhere in your code are you attempting to get the value of the attributes. You are simply accessing message.Body. You need to access message.Attributes.