Writing a single file to multiple s3 buckets with gulp-awspublish - amazon-web-services

I have a simple single-page app, that is deployed to an S3 bucket using gulp-awspublish. We use inquirer.js (via gulp-prompt) to ask the developer which bucket to deploy to.
Sometimes the app may be deployed to several S3 buckets. Currently, we only allow one bucket to be selected, so the developer has to gulp deploy for each bucket in turn. This is dull and prone to error.
I'd like to be able to select multiple buckets and deploy the same content to each. It's simple to select multiple buckets with inquirer.js/gulp-prompt, but not simple to generate arbitrary multiple S3 destinations from a single stream.
Our deploy task is based upon generator-webapp's S3 recipe. The recipe suggests gulp-rename to rewrite the path to write to a specific bucket. Currently our task looks like this:
gulp.task('deploy', ['build'], () => {
// get AWS creds
if (typeof(config.awsCreds) !== 'object') {
return console.error('No config.awsCreds settings found. See README');
}
var dirname;
const publisher = $.awspublish.create({
key: config.awsCreds.key,
secret: config.awsCreds.secret,
bucket: config.awsCreds.bucket
});
return gulp.src('dist/**/*.*')
.pipe($.prompt.prompt({
type: 'list',
name: 'dirname',
message: 'Using the ‘' + config.awsCreds.bucket + '’ bucket. Which hostname would you like to deploy to?',
choices: config.awsCreds.dirnames,
default: config.awsCreds.dirnames.indexOf(config.awsCreds.dirname)
}, function (res) {
dirname = res.dirname;
}))
.pipe($.rename(function(path) {
path.dirname = dirname + '/dist/' + path.dirname;
}))
.pipe(publisher.publish())
.pipe(publisher.cache())
.pipe($.awspublish.reporter());
});
It's hopefully obvious, but config.awsCreds might look something like:
awsCreds: {
dirname: 'default-bucket',
dirnames: ['default-bucket', 'other-bucket', 'another-bucket']
}
Gulp-rename rewrites the destination path to use the correct bucket.
We can select multiple buckets by using "checkbox" instead of "list" for the gulp-prompt options, but I'm not sure how to then deliver it to multiple buckets.
In a nutshell, if $.prompt returns an array of strings instead of a string, how can I write the source to multiple destinations (buckets) instead of a single bucket?
Please keep in mind that gulp.dest() is not used -- only gulp.awspublish() -- and we don't know how many buckets might be selected.

Never used S3, but if I understand your question correctly a file js/foo.js should be renamed to default-bucket/dist/js/foo.js and other-bucket/dist/js/foo.js when the checkboxes default-bucket and other-bucket are selected?
Then this should do the trick:
// additionally required modules
var path = require('path');
var through = require('through2').obj;
gulp.task('deploy', ['build'], () => {
if (typeof(config.awsCreds) !== 'object') {
return console.error('No config.awsCreds settings found. See README');
}
var dirnames = []; // array for selected buckets
const publisher = $.awspublish.create({
key: config.awsCreds.key,
secret: config.awsCreds.secret,
bucket: config.awsCreds.bucket
});
return gulp.src('dist/**/*.*')
.pipe($.prompt.prompt({
type: 'checkbox', // use checkbox instead of list
name: 'dirnames', // use different result name
message: 'Using the ‘' + config.awsCreds.bucket +
'’ bucket. Which hostname would you like to deploy to?',
choices: config.awsCreds.dirnames,
default: config.awsCreds.dirnames.indexOf(config.awsCreds.dirname)
}, function (res) {
dirnames = res.dirnames; // store array of selected buckets
}))
// use through2 instead of gulp-rename
.pipe(through(function(file, enc, done) {
dirnames.forEach((dirname) => {
var f = file.clone();
f.path = path.join(f.base, dirname, 'dist',
path.relative(f.base, f.path));
this.push(f);
});
done();
}))
.pipe(publisher.cache())
.pipe($.awspublish.reporter());
});
Notice the comments where I made changes from the code you posted.
What this does is use through2 to clone each file passing through the stream. Each file is cloned as many times as there were bucket checkboxes selected and each clone is renamed to end up in a different bucket.

Related

I am learning to create AWS Lambdas. I want to create a "chain": S3 -> 4 Chained Lambda()'s -> RDS. I can't get the first lambda to call the second

I really tried everything. Surprisingly google has not many answers when it comes to this.
When a certain .csv file is uploaded to a S3 bucket I want to parse it and place the data into a RDS database.
My goal is to learn the lambda serverless technology, this is essentially an exercise. Thus, I over-engineered the hell out of it.
Here is how it goes:
S3 Trigger when the .csv is uploaded -> call lambda (this part fully works)
AAA_Thomas_DailyOverframeS3CsvToAnalytics_DownloadCsv downloads the csv from S3 and finishes with essentially the plaintext of the file. It is then supposed to pass it to the next lambda. The way I am trying to do this is by putting the second lambda as destination. The function works, but the second lambda is never called and I don't know why.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_ParseCsv gets the plaintext as input and returns a javascript object with the parsed data.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_DecryptRDSPass only connects to KMS, gets the encrcypted RDS password, and passes it along with the data it received as input to the last lambda.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_PutDataInRds then finally puts the data in RDS.
I created a custom VPC with custom subnets, route tables, gateways, peering connections, etc. I don't know if this is relevant but function 2. only has access to the s3 endpoint, 3. does not have any internet access whatsoever, 4. is the only one that has normal internet access (it's the only way to connect to KSM), and 5. only has access to the peered VPC which hosts the RDS.
This is the code of the first lambda:
// dependencies
const AWS = require('aws-sdk');
const util = require('util');
const s3 = new AWS.S3();
let region = process.env;
exports.handler = async (event, context, callback) =>
{
var checkDates = process.env.CheckDates == "false" ? false : true;
var ret = [];
var checkFileDate = function(actualFileName)
{
if (!checkDates)
return true;
var d = new Date();
var expectedFileName = 'Overframe_-_Analytics_by_Day_Device_' + d.getUTCFullYear() + '-' + (d.getUTCMonth().toString().length == 1 ? "0" + d.getUTCMonth() : d.getUTCMonth()) + '-' + (d.getUTCDate().toString().length == 1 ? "0" + d.getUTCDate() : d.getUTCDate());
return expectedFileName == actualFileName.substr(0, expectedFileName.length);
};
for (var i = 0; i < event.Records.length; ++i)
{
var record = event.Records[i];
try {
if (record.s3.bucket.name != process.env.S3BucketName)
{
console.error('Unexpected notification, unknown bucket: ' + record.s3.bucket.name);
continue;
}
if (!checkFileDate(record.s3.object.key))
{
console.error('Unexpected file, or date is not today\'s: ' + record.s3.object.key);
continue;
}
const params = {
Bucket: record.s3.bucket.name,
Key: record.s3.object.key
};
var csvFile = await s3.getObject(params).promise();
var allText = csvFile.Body.toString('utf-8');
console.log('Loaded data:', {Bucket: params.Bucket, Filename: params.Key, Text: allText});
ret.push(allText);
} catch (error) {
console.log("Couldn't download CSV from S3", error);
return { statusCode: 500, body: error };
}
}
// I've been randomly trying different ways to return the data, none works. The data itself is correct , I checked with console.log()
const response = {
statusCode: 200,
body: { "Records": ret }
};
return ret;
};
While this shows how the lambda was set up, especially its destination:
I haven't posted on Stackoverflow in 7 years. That's how desperate I am. Thanks for the help.
Rather than getting each Lambda to call the next one take a look at AWS managed service for state machines, step functions which can handle this workflow for you.
By providing input and outputs you can pass output to the next function, with retry logic built into it.
If you haven't much experience AWS has a tutorial on setting up a step function through chaining Lambdas.
By using this you also will not need to account for configuration issues such as Lambda timeouts. In addition it allows your code to be more modular which improves testing the individual functionality, whilst also isolating issues.
The execution roles of all Lambda functions, whose destinations include other Lambda functions, must have the lambda:InvokeFunction IAM permission in one of their attached IAM policies.
Here's a snippet from Lambda documentation:
To send events to a destination, your function needs additional permissions. Add a policy with the required permissions to your function's execution role. Each destination service requires a different permission, as follows:
Amazon SQS – sqs:SendMessage
Amazon SNS – sns:Publish
Lambda – lambda:InvokeFunction
EventBridge – events:PutEvents

Fetching multiple objects with same prefix from s3 bucket

I have multiple folders in an s3 bucket and each folder contains some .txt files. Now I want to fetch just 10 .txt files from a given folder using javascript API.
For eg: the path is something like this
s3bucket/folder1/folder2/folder3/id
Now folder id is the one containing multiple .txt files. There are multiple id folders inside folder3. I want to pass id and get 10 s3 objects which have id as prefix. Is this possible using listObjectsV2? How do I limit the response to just 10 objects.
____obj1.txt
______id1----|____obj2.txt
| _____obj3.txt
|_____ id2---|____obj4.txt
s3bucket/folder1/folder2/folder3-| ____obj5.txt
|_____ id3---|____obj6.txt
So if I pass
var params= {Bucket:"s3bucket",Key:"folder1/folder2/folder3/id1"}
I should get obj1.txt and obj2.txt in response.
Which S3 method are you using? I suggest to use listObjectsV2 to achieve your goal. A possible call might look as the following
const s3 = new AWS.S3();
const { Contents } = await s3.listObjectsV2({
Bucket: 's3bucket',
Prefix: 'folder1/folder2/folder3/id1',
MaxKeys: 10
}).promise();
To get the object values you need to call getObject on each Key.
const responses = await Promise.all((Contents || []).map(({ Key }) => (
s3.getObject({
Bucket: 's3bucket',
Key
}).promise()
)));

s3.putObject(params).promise() does not upload file, but successfully executes then() callback

I had pretty long number of attempts to put a file in S3 bucket, after which I have to update my model.
I have following code (note that I have tried commented lines too. It works neither with comments nor without it.)
The problem observed:
Everything in the first .then() block (successCallBack()) gets successfully executed, but I do not see result of s3.putObject().
The bucket in question is public, no access restrictions. It used to work with sls offline option, then because of it not working in AWS I had to make lot of changes and managed to make successCallback() work which does the database work successfully. However, file upload still doesn't work.
Some questions:
While solving this, the real questions I am pondering / searching are,
Is lambda supposed to return something? I saw AWS docs but they have fragmented code snippets.
Putting await in front of s3.putObject(params).promise() does not help. I see samples with and without await in front of things that have AWS Promise() function call. Not sure which ones are correct.
What is the correct way when you have chained async functions to accomplish within one lambda function?
UPDATE:
var myJSON = {}
const createBook = async (event) => {
let bucketPath = "https://com.xxxx.yyyy.aa-bb-zzzzzz-1.amazonaws.com"
let fileKey = //file key
let path = bucketPath + "/" + fileKey;
myJSON = {
//JSON from headers
}
var s3 = new AWS.S3();
let buffer = Buffer.from(event.body, 'utf8');
var params = {Bucket: 'com.xxxx.yyyy', Key: fileKey, Body: buffer, ContentEncoding: 'utf8'};
let putObjPromise = s3.putObject(params).promise();
putObjPromise
.then(successCallBack())
.then(c => {
console.log('File upload Success!');
return {
statusCode: 200,
headers: { 'Content-Type': 'text/plain' },
body: "Success!!!"
}
})
.catch(err => {
let str = "File upload / Creation error:" + err;
console.log(str);
return {
statusCode: err.statusCode || 500,
headers: { 'Content-Type': 'text/plain' },
body: str
}
});
}
const successCallBack = async () => {
console.log("Inside success callback - " + JSON.stringify(myJSON)) ;
const { myModel } = await connectToDatabase()
console.log("After connectToDatabase")
const book = await myModel.create(myJSON)
console.log(msg);
}
Finally, I got this to work. My code worked already in sls offline setup.
What was different on AWS endpoint?
What I observed on console was the fact that my lambda was set to run under VPC.
When I chose No VPC, it worked. I do not know if this is the best practice. There must be some security advantage obtained by functions running under VPC.
I came across this huge explanation about VPC but I could not find anything related to S3.
The code posted in the question currently runs fine on AWS endpoint.
If the lambda is running in a VPC then you would need a VPC endpoint to access a service outside the vpc. S3 would be outside the VPC. Perhaps if security is an issue then creating a VPC endpoint would solve the issue in a better way. Also, if security is an issue, then perhaps adding a policy (or using the default AmazonS3FullAccess policy) to the role that the lambda is using, then the S3 bucket wouldn't need to be public.

delete folder from s3 nodejs

Hey guys I was trying to delete a folder from s3 with stuff in it but deleteObjects wasn't working so I found this script online and it works great my question is why does it work? Why do you have to listObjects when deleting a folder on s3 why cant I just pass it the objects name? Why doesn't It error when I attempt to delete the folder without listing the objects first.
first attempt (doesnt work)
var filePath2 = "templates/" + key + "/test/";
var toPush = { Key: filePath2 };
deleteParams.Delete.Objects.push(toPush);
console.log("deleteParams", deleteParams);
console.log("deleteParams.Delete", deleteParams.Delete);
const deleteResult = await s3.deleteObjects(deleteParams).promise();
console.log("deleteResult", deleteResult);
keep in mind folderPath2 is a folder that has other stuff in it I get no error but yet the catch isn't triggered and it says deleted and than the folder name.
second attempt (works)
async function deleteFromS3(bucket, path) {
const listParams = {
Bucket: bucket,
Prefix: path
};
const listedObjects = await s3.listObjectsV2(listParams).promise();
console.log("listedObjects", listedObjects);
if (listedObjects.Contents.length === 0) return;
const deleteParams = {
Bucket: bucket,
Delete: { Objects: [] }
};
listedObjects.Contents.forEach(({ Key }) => {
deleteParams.Delete.Objects.push({ Key });
});
console.log("deleteParams", deleteParams);
const deleteResult = await s3.deleteObjects(deleteParams).promise();
console.log("deleteResult", deleteResult);
if (listedObjects.IsTruncated && deleteResult)
await deleteFromS3(bucket, path);
}
than I call the function like so
const result = await deleteFromS3(myBucketName, folderPath);
Folders do not exist in Amazon S3. It is a flat object storage system, where the filename (Key) for each object contains the full path.
While Amazon S3 does support the concept of a Common Prefix, which can make things appear as though they are in folders/directories, folders do not actually exist.
For example, you could run a command like this:
aws s3 cp foo.txt s3://my-bucket/folder1/folder2/foo.txt
This would work even if the folders do not exist! It is merely storing an object with a Key of folder1/folder2/foo.txt.
If you were then to delete that object, the 'folder' would disappear because no object has it as a path. That is because the folder never actually existed.
Sometimes people want an empty folder to appear, so they create a zero-length object with the same name as the folder, eg folder1/folder2/.
So, your first program did not work because it deleted the 'folder', which has nothing to do with deleting the content of the folder (since there is no concept of 'content' of a folder).

Copying one table to another in DynamoDB

What's the best way to identically copy one table over to a new one in DynamoDB?
(I'm not worried about atomicity).
Create a backup(backups option) and restore the table with a new table name. That would get all the data into the new table.
Note: Takes considerable amount of time depending on the table size
I just used the python script, dynamodb-copy-table, making sure my credentials were in some environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), and it worked flawlessly. It even created the destination table for me.
python dynamodb-copy-table.py src_table dst_table
The default region is us-west-2, change it with the AWS_DEFAULT_REGION env variable.
AWS Pipeline provides a template which can be used for this purpose: "CrossRegion DynamoDB Copy"
See: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-crossregion-ddb-create.html
The result is a simple pipeline that looks like:
Although it's called CrossRegion you can easily use it for the same region as long the destination table name is different (Remember that table names are unique per account and region)
You can use Scan to read the data and save it to the new table.
On the AWS forums a guy from the AWS team posted another approach using EMR: How Do I Duplicate a Table?
Here's one solution to copy all items from one table to another, just using shell scripting, the AWS CLI and jq. Will work OK for smallish tables.
# exit on error
set -eo pipefail
# tables
TABLE_FROM=<table>
TABLE_TO=<table>
# read
aws dynamodb scan \
--table-name "$TABLE_FROM" \
--output json \
| jq "{ \"$TABLE_TO\": [ .Items[] | { PutRequest: { Item: . } } ] }" \
> "$TABLE_TO-payload.json"
# write
aws dynamodb batch-write-item --request-items file://"$TABLE_TO-payload.json"
# clean up
rm "$TABLE_TO-payload.json"
If you both tables to be identical, you'd want to delete all items in TABLE_TO first.
DynamoDB now supports importing from S3.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/
So, probably in almost all use cases, the easiest and cheapest way to replicate a table is
Use "Export to S3" feature to dump entire table into S3. Since this uses backup to generate the dump, table's throughput is not affected, and is very fast as well. You need to have backups (PITR) enabled. See https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
Use "Import from S3" to import the dump created in step 1. This automatically requires you to create a new table.
Use this node js module : copy-dynamodb-table
This is a little script I made to copy the contents of one table to another.
It's based on the AWS-SDK v3. Not sure how well it would scale to big tables but as a quick and dirty solution it does the job.
It gets your AWS credentials from a profile in ~/.aws/credentials change default to the name of the profile you want to use.
Other than that it takes two args one for the source table and one for destination
const { fromIni } = require("#aws-sdk/credential-providers");
const { DynamoDBClient, ScanCommand, PutItemCommand } = require("#aws-sdk/client-dynamodb");
const ddbClient = new DynamoDBClient({
credentials: fromIni({profile: "default"}),
region: "eu-west-1",
});
const args = process.argv.slice(2);
console.log(args)
async function main() {
const { Items } = await ddbClient.send(
new ScanCommand({
TableName: args[0],
})
);
console.log("Successfully scanned table")
console.log("Copying", Items.length, "Items")
const putPromises = [];
Items.forEach((item) => {
putPromises.push(
ddbClient.send(
new PutItemCommand({
TableName: args[1],
Item: item,
})
)
);
});
await Promise.all(putPromises);
console.log("Successfully copied table")
}
main();
Usage
node copy-table.js <source_table_name> <destination_table_name>
Python + boto3 🚀
The script is idempotent as far as you maintain the same Keys.
import boto3
def migrate(source, target):
dynamo_client = boto3.client('dynamodb', region_name='us-east-1')
dynamo_target_client = boto3.client('dynamodb', region_name='us-west-2')
dynamo_paginator = dynamo_client.get_paginator('scan')
dynamo_response = dynamo_paginator.paginate(
TableName=source,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamo_response:
for item in page['Items']:
dynamo_target_client.put_item(
TableName=target,
Item=item
)
if __name__ == '__main__':
migrate('awesome-v1', 'awesome-v2')
On November 29th, 2017 Global Tables was introduced. This may be useful depending on your use case, which may not be the same as the original question. Here are a few snippets from the blog post:
Global Tables – You can now create tables that are automatically replicated across two or more AWS Regions, with full support for multi-master writes, with a couple of clicks. This gives you the ability to build fast, massively scaled applications for a global user base without having to manage the replication process.
...
You do not need to make any changes to your existing code. You simply send write requests and eventually consistent read requests to a DynamoDB endpoint in any of the designated Regions (writes that are associated with strongly consistent reads should share a common endpoint). Behind the scenes, DynamoDB implements multi-master writes and ensures that the last write to a particular item prevails. When you use Global Tables, each item will include a timestamp attribute representing the time of the most recent write. Updates are propagated to other Regions asynchronously via DynamoDB Streams and are typically complete within one second (you can track this using the new ReplicationLatency and PendingReplicationCount metrics).
Another option is to download the table as a .csv file and upload it with the following snippet of code.
This also eliminates the need for providing your AWS credentials to a packages such as the one #ezzat suggests.
Create a new folder and add the following two files and your exported table
Edit uploadToDynamoDB.js and add the filename of the exported table and your table name
Run npm install in the folder
Run node uploadToDynamodb.js
File: Package.json
{
"name": "uploadtodynamodb",
"version": "1.0.0",
"description": "",
"main": "uploadToDynamoDB.js",
"author": "",
"license": "ISC",
"dependencies": {
"async": "^3.1.1",
"aws-sdk": "^2.624.0",
"csv-parse": "^4.8.5",
"fs": "0.0.1-security",
"lodash": "^4.17.15",
"uuid": "^3.4.0"
}
}
File: uploadToDynamoDB.js
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
var _ = require('lodash')
var AWS = require('aws-sdk');
// If your table is in another region, make sure to update this
AWS.config.update({ region: "eu-central-1" });
var ddb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });
var csv_filename = "./TABLE_CSV_EXPORT_FILENAME.csv";
var tableName = "TABLENAME"
function prepareData(data_chunk) {
const items = data_chunk.map(obj => {
const keys = Object.keys(obj)
let attr = Object.values(obj)
attr = attr.map(a => {
let newAttr;
// Can we make this an integer
if (isNaN(Number(a))) {
newAttr = { "S": a }
} else {
newAttr = { "N": a }
}
return newAttr
})
let item = _.zipObject(keys, attr)
return {
PutRequest: {
Item: item
}
}
})
var params = {
RequestItems: {
[tableName]: items
}
};
return params
}
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
const params = prepareData(item_data)
ddb.batchWriteItem(
params,
function (err, data) {
if (err) {
console.log("Error", err);
} else {
console.log("Success", data);
}
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
It's been a very long time since the question was posted and AWS has been continuously improvising features. At the time of writing this answer, we have the option to export the Table to S3 bucket then use the import feature to import this data from S3 into a new table which automatically will re-create a new table with the data. Plese refer this blog for more idea on export & import
Best part is that you get to change the name, PK or SK.
Note: You have to enable PITR (might incur additional costs). Always best to refer documents.
Here is another simple python util script for this: ddb_table_copy.py. I use it often.
usage: ddb_table_copy.py [-h] [--dest-table DEST_TABLE] [--dest-file DEST_FILE] source_table
Copy all DynamoDB items from SOURCE_TABLE to either DEST_TABLE or DEST_FILE. Useful for migrating data during a stack teardown/re-creation.
positional arguments:
source_table Name of source table in DynamoDB.
optional arguments:
-h, --help show this help message and exit
--dest-table DEST_TABLE
Name of destination table in DynamoDB.
--dest-file DEST_FILE
2) a valid file path string to save the items to, e.g. 'items.json'.