Access to AWS Kinesis tables in DynamoDB using AWS Cli

Access to AWS Kinesis tables in DynamoDB using AWS Cli - amazon-web-services

I am new to AWS Kinesis. Trying to learn how to build distributed application using AWS DynamoDB. Could someone tell me how to access the tables in DynamoDB used by my streams in AWS Kinesis using the AWS Cli? Can I query the tables?

Surely. Kinesis-stream uses Amazon Dynamodb table for consumer offset management.
your table will have the same name as your consumer name (defined via KCL). you can see all consumer_offset_tables via console (https://console.aws.amazon.com/dynamodb/home)
consumer could be running in multiple instances, but only one instance (THE leaseOwner)can consume from one partition (or Shard what Kinesis calls it). If this consumer fails another instance of the same consumer will take over and continue processing from the checkpoint.
checkpoint is the last processed event
the shard that the consumer instance is processing is called leaseKey which is unique for a table.
The data structure of the key value based document in Dynamodb would be as below,
where
"S" - Char array
"N" - Number
{
"leaseOwner": {
"S": "SmartConsumerStream_Consumer-192.168.1.83"
},
"checkpoint": {
"S": "49570630332110756564477900867375857710984404992079691778"
},
"checkpointSubSequenceNumber": {
"N": "0"
},
"leaseCounter": {
"N": "16"
},
"leaseKey": {
"S": "shardId-000000000000"
},
"ownerSwitchesSinceCheckpoint": {
"N": "0"
}
}
You can use Dynamodb API to get the current offset for given partitionKey or leaseKey. You can only query by leaseKey because thats the indexed key in table. Its created by Kinesis-stream itself.
You can use the stream-driver I'm writing at here, which gives you interface to get the consumer offset very very easily.
Here are kinesis-stream tests which might be helpful too.
In summary, get the consumer offset using JAVA api
-- put aws credentials in ~/.aws/credentials
public Map<String, String> getConsumerPosition() {
DynamoDB dynamoDB = new DynamoDB(getOffsetConnection()); //
Table consumerOffsetTable = dynamoDB.getTable("your_consumer_id");
Map<String, Object> leaseOwner = consumerOffsetTable.getItem("leaseKey", "shardId-000000000000").asMap();
return new HashMap<String, String>(){{
put(leaseOwner.get("leaseKey").toString(), leaseOwner.get("checkpoint").toString());
}};
}
public AmazonDynamoDB getOffsetConnection() {
AmazonDynamoDBClient dynamoDB = new AmazonDynamoDBClient(getAuthProfileCredentials(), getHttpConfiguration());
return dynamoDB;
}
private ProfileCredentialsProvider getAuthProfileCredentials() {
return new ProfileCredentialsProvider("your-aws-auth-profile_in_~/.aws/credentials");
}
private ClientConfiguration getHttpConfiguration() {
ClientConfiguration clientConfiguration = new ClientConfiguration();
return clientConfiguration;
}
Hope it helps, let me know if I can help some other ways.

Related

I am learning to create AWS Lambdas. I want to create a "chain": S3 -> 4 Chained Lambda()'s -> RDS. I can't get the first lambda to call the second

I really tried everything. Surprisingly google has not many answers when it comes to this.
When a certain .csv file is uploaded to a S3 bucket I want to parse it and place the data into a RDS database.
My goal is to learn the lambda serverless technology, this is essentially an exercise. Thus, I over-engineered the hell out of it.
Here is how it goes:
S3 Trigger when the .csv is uploaded -> call lambda (this part fully works)
AAA_Thomas_DailyOverframeS3CsvToAnalytics_DownloadCsv downloads the csv from S3 and finishes with essentially the plaintext of the file. It is then supposed to pass it to the next lambda. The way I am trying to do this is by putting the second lambda as destination. The function works, but the second lambda is never called and I don't know why.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_ParseCsv gets the plaintext as input and returns a javascript object with the parsed data.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_DecryptRDSPass only connects to KMS, gets the encrcypted RDS password, and passes it along with the data it received as input to the last lambda.
AAA_Thomas_DailyOverframeS3CsvToAnalytics_PutDataInRds then finally puts the data in RDS.
I created a custom VPC with custom subnets, route tables, gateways, peering connections, etc. I don't know if this is relevant but function 2. only has access to the s3 endpoint, 3. does not have any internet access whatsoever, 4. is the only one that has normal internet access (it's the only way to connect to KSM), and 5. only has access to the peered VPC which hosts the RDS.
This is the code of the first lambda:
// dependencies
const AWS = require('aws-sdk');
const util = require('util');
const s3 = new AWS.S3();
let region = process.env;
exports.handler = async (event, context, callback) =>
{
var checkDates = process.env.CheckDates == "false" ? false : true;
var ret = [];
var checkFileDate = function(actualFileName)
{
if (!checkDates)
return true;
var d = new Date();
var expectedFileName = 'Overframe_-_Analytics_by_Day_Device_' + d.getUTCFullYear() + '-' + (d.getUTCMonth().toString().length == 1 ? "0" + d.getUTCMonth() : d.getUTCMonth()) + '-' + (d.getUTCDate().toString().length == 1 ? "0" + d.getUTCDate() : d.getUTCDate());
return expectedFileName == actualFileName.substr(0, expectedFileName.length);
};
for (var i = 0; i < event.Records.length; ++i)
{
var record = event.Records[i];
try {
if (record.s3.bucket.name != process.env.S3BucketName)
{
console.error('Unexpected notification, unknown bucket: ' + record.s3.bucket.name);
continue;
}
if (!checkFileDate(record.s3.object.key))
{
console.error('Unexpected file, or date is not today\'s: ' + record.s3.object.key);
continue;
}
const params = {
Bucket: record.s3.bucket.name,
Key: record.s3.object.key
};
var csvFile = await s3.getObject(params).promise();
var allText = csvFile.Body.toString('utf-8');
console.log('Loaded data:', {Bucket: params.Bucket, Filename: params.Key, Text: allText});
ret.push(allText);
} catch (error) {
console.log("Couldn't download CSV from S3", error);
return { statusCode: 500, body: error };
}
}
// I've been randomly trying different ways to return the data, none works. The data itself is correct , I checked with console.log()
const response = {
statusCode: 200,
body: { "Records": ret }
};
return ret;
};
While this shows how the lambda was set up, especially its destination:
I haven't posted on Stackoverflow in 7 years. That's how desperate I am. Thanks for the help.

Rather than getting each Lambda to call the next one take a look at AWS managed service for state machines, step functions which can handle this workflow for you.
By providing input and outputs you can pass output to the next function, with retry logic built into it.
If you haven't much experience AWS has a tutorial on setting up a step function through chaining Lambdas.
By using this you also will not need to account for configuration issues such as Lambda timeouts. In addition it allows your code to be more modular which improves testing the individual functionality, whilst also isolating issues.

The execution roles of all Lambda functions, whose destinations include other Lambda functions, must have the lambda:InvokeFunction IAM permission in one of their attached IAM policies.
Here's a snippet from Lambda documentation:
To send events to a destination, your function needs additional permissions. Add a policy with the required permissions to your function's execution role. Each destination service requires a different permission, as follows:
Amazon SQS – sqs:SendMessage
Amazon SNS – sns:Publish
Lambda – lambda:InvokeFunction
EventBridge – events:PutEvents

How do I run mutlple SQL statements in an AppSync resolver?

I have two tables, cuts and holds. In response to an event in my application I want to be able to move the values from a hold entry with a given id across to the cuts table. The naïve way is to do an INSERT and then a DELETE.
How do I run multiple sql statements in an AppSync resolver to achieve that result? I have tried the following (replacing sql by statements and turning it into an array) without success.
{
"version" : "2017-02-28",
"operation": "Invoke",
#set($id = $util.autoId())
"payload": {
"statements": [
"INSERT INTO cuts (id, rollId, length, reason, notes, orderId) SELECT '$id', rollId, length, reason, notes, orderId FROM holds WHERE id=:ID",
"DELETE FROM holds WHERE id=:ID"
],
"variableMapping": {
":ID": "$context.arguments.id"
},
"responseSQL": "SELECT * FROM cuts WHERE id = '$id'"
}
}

If you're using the "AWS AppSync Using Amazon Aurora as a Data Source via AWS Lambda" found here https://github.com/aws-samples/aws-appsync-rds-aurora-sample, you won't be able to send multiple statements in the sql field
If you are using the AWS AppSync integration with the Aurora Serverless Data API, you can pass up to 2 statements in a statements array such as in the example below:
{
"version": "2018-05-29",
"statements": [
"select * from Pets WHERE id='$ctx.args.input.id'",
"delete from Pets WHERE id='$ctx.args.input.id'"
]
}

You will be able to do it as the following.If you're using the "AWS AppSync Using Amazon Aurora as a Data Source via AWS Lambda" found here https://github.com/aws-samples/aws-appsync-rds-aurora-sample.
In the resolver, add the "sql0" & "sql1" field (you can name them what ever you want) :
{
"version" : "2017-02-28",
"operation": "Invoke",
#set($id = $util.autoId())
"payload": {
"sql":"INSERT INTO cuts (id, rollId, length, reason, notes, orderId)",
"sql0":"SELECT '$id', rollId, length, reason, notes, orderId FROM holds WHERE id=:ID",
"sql1":"DELETE FROM holds WHERE id=:ID",
"variableMapping": {
":ID": "$context.arguments.id"
},
"responseSQL": "SELECT * FROM cuts WHERE id = '$id'"
}
}
In your lambda, add the following piece of code:
if (event.sql0) {
const inputSQL0 = populateAndSanitizeSQL(event.sql0, event.variableMapping, connection);
await executeSQL(connection, inputSQL0);
}
if (event.sql1) {
const inputSQL1 = populateAndSanitizeSQL(event.sql1, event.variableMapping, connection);
await executeSQL(connection, inputSQL1);
}
With this approach, you can send to your lambda as much sql statements as you want,and then your lambda will execute them.

Trying to automate AMI backup of EC2 instance

I have tried automating the backup of AWS ec2 instance using lambda function and triggering a cloudwatch event. I am using a free tier service.
I have scheduled the backup every 5 mins but, After first backup i.e AMI creation, there is no further AMI creation.
Can we create the multiple AMI of the same instance?
Below is the lambda function used.
Regards
Monika
var aws = require('aws-sdk');
aws.config.region = 'us-east-1';
var ec2 = new aws.EC2();
var now = new Date();
var date = now.toISOString().substring(0, 10);
var hours = now.getHours() ;
var minutes = now.getMinutes() ;
exports.handler = function(event, context) {
var instanceparams = {
Filters: [{
Name: 'tag:Backup',
Values: [
'yes'
]
}]
};
ec2.describeInstances(instanceparams, function(err, data) {
if (err) console.log(err, err.stack);
else {
for (var i in data.Reservations) {
for (var j in data.Reservations[i].Instances) {
var instanceid = data.Reservations[i].Instances[j].InstanceId;
var nametag = data.Reservations[i].Instances[j].Tags;
for (var k in data.Reservations[i].Instances[j].Tags) {
if (data.Reservations[i].Instances[j].Tags[k].Key == 'Name') {
var name = data.Reservations[i].Instances[j].Tags[k].Value;
}
}
console.log("Creating AMIs of the Instance: ", name);
var imageparams = {
InstanceId: instanceid,
Name: name + "_" + date + "_" + hours + "-" + minutes,
NoReboot: true
};
ec2.createImage(imageparams, function(err, data) {
if (err) console.log(err, err.stack);
else {
var image = data.ImageId;
console.log(image);
var tagparams = {
Resources: [image],
Tags: [{
Key: 'DeleteOn',
Value: 'yes'
}]
};
ec2.createTags(tagparams, function(err, data) {
if (err) console.log(err, err.stack);
else console.log("Tags added to the created AMIs");
});
}
});
}
}
}
});
};

It is not being created because it is impossible to have the same AMI name for multiple instances.

An AMI is the same as a Snapshot, except it can also be used to launch a new instance. An AMI can also contain multiple snapshots (multiple drives).
If your system operates from one volume (the boot volume), having an AMI is an easy way to launch a new instance with exactly the same data. This is normally done to launch an instance with pre-installed software (thus making it in a known state), but can also be used for backup purposes.
Having a snapshot as a backup certainly provides a copy of the volume as at the time of snapshot creation, but to restore the snapshot you actually have to restore the snapshot to a new EBS volume, convert the snapshot to an AMI, then launch an instance from it. (It's a bit harder if it is a Windows boot volume.)
Snapshots and AMIs are incremental, only needing to copy blocks that have been added or changed since a previous snapshot/AMI was created. Thus, they can be very fast to create.
It is not immediately obvious why your code is not functioning correctly. I would suggest adding debug statements before each API call and within the callbacks to obtain more information.
For reference, see also an EBS Snapshotter in Python.

You can automate your AMI backups. I'm not a Lambda expert, but it can be done -- make sure the IAM roles have the correct permissions and that your functions look for the EC2 Backup and Retention tags. Then you can schedule it through the management console. Here's an article with solid details on creating this function. There are other ways to automate snapshots/backups in AWS, if interested.

AWS API gateway to fetch records from multiple tables in a single API call

I have two tables in DynamoDB both of which have same column/key "sessionid". Using AWS API gateway I want to fetch and display records from both tables when I pass the value for the column "sessionid"
Currently, my mapping template looks like below which is retrieving records from Table1 only:
Integration Request-
{
"TableName": "Table1",
"KeyConditionExpression": "sessionid = :v1",
"ExpressionAttributeValues": {
":v1": {
"S": "$input.params('sessionid')"
}
}
}
Integration Response-
#set($inputRoot = $input.path('$'))
{
"Table1": [
#foreach($elem in $inputRoot.Items) {
"sessionid": "$elem.sessionid.S",
"rowId": "$elem.rowId.S"
}#if($foreach.hasNext),#end
#end
]
}
How can I integrate the mapping for the second table "Table2" in the above mapping, so records from both tables are retrieved in a single API call?
Your suggestions will help.

API Gateway can only make one single downstream API call. Unfortunately, it appears that DynamoDB API only supports a single Table for a scan/query/GetItem request. If you want to aggregate items from multiple tables, you would have to use a Lambda function between API GW and DDB.

Copying one table to another in DynamoDB

What's the best way to identically copy one table over to a new one in DynamoDB?
(I'm not worried about atomicity).

Create a backup(backups option) and restore the table with a new table name. That would get all the data into the new table.
Note: Takes considerable amount of time depending on the table size

I just used the python script, dynamodb-copy-table, making sure my credentials were in some environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), and it worked flawlessly. It even created the destination table for me.
python dynamodb-copy-table.py src_table dst_table
The default region is us-west-2, change it with the AWS_DEFAULT_REGION env variable.

AWS Pipeline provides a template which can be used for this purpose: "CrossRegion DynamoDB Copy"
See: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-crossregion-ddb-create.html
The result is a simple pipeline that looks like:
Although it's called CrossRegion you can easily use it for the same region as long the destination table name is different (Remember that table names are unique per account and region)

You can use Scan to read the data and save it to the new table.
On the AWS forums a guy from the AWS team posted another approach using EMR: How Do I Duplicate a Table?

Here's one solution to copy all items from one table to another, just using shell scripting, the AWS CLI and jq. Will work OK for smallish tables.
# exit on error
set -eo pipefail
# tables
TABLE_FROM=<table>
TABLE_TO=<table>
# read
aws dynamodb scan \
--table-name "$TABLE_FROM" \
--output json \
| jq "{ \"$TABLE_TO\": [ .Items[] | { PutRequest: { Item: . } } ] }" \
> "$TABLE_TO-payload.json"
# write
aws dynamodb batch-write-item --request-items file://"$TABLE_TO-payload.json"
# clean up
rm "$TABLE_TO-payload.json"
If you both tables to be identical, you'd want to delete all items in TABLE_TO first.

DynamoDB now supports importing from S3.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/
So, probably in almost all use cases, the easiest and cheapest way to replicate a table is
Use "Export to S3" feature to dump entire table into S3. Since this uses backup to generate the dump, table's throughput is not affected, and is very fast as well. You need to have backups (PITR) enabled. See https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
Use "Import from S3" to import the dump created in step 1. This automatically requires you to create a new table.

Use this node js module : copy-dynamodb-table

This is a little script I made to copy the contents of one table to another.
It's based on the AWS-SDK v3. Not sure how well it would scale to big tables but as a quick and dirty solution it does the job.
It gets your AWS credentials from a profile in ~/.aws/credentials change default to the name of the profile you want to use.
Other than that it takes two args one for the source table and one for destination
const { fromIni } = require("#aws-sdk/credential-providers");
const { DynamoDBClient, ScanCommand, PutItemCommand } = require("#aws-sdk/client-dynamodb");
const ddbClient = new DynamoDBClient({
credentials: fromIni({profile: "default"}),
region: "eu-west-1",
});
const args = process.argv.slice(2);
console.log(args)
async function main() {
const { Items } = await ddbClient.send(
new ScanCommand({
TableName: args[0],
})
);
console.log("Successfully scanned table")
console.log("Copying", Items.length, "Items")
const putPromises = [];
Items.forEach((item) => {
putPromises.push(
ddbClient.send(
new PutItemCommand({
TableName: args[1],
Item: item,
})
)
);
});
await Promise.all(putPromises);
console.log("Successfully copied table")
}
main();
Usage
node copy-table.js <source_table_name> <destination_table_name>

Python + boto3 🚀
The script is idempotent as far as you maintain the same Keys.
import boto3
def migrate(source, target):
dynamo_client = boto3.client('dynamodb', region_name='us-east-1')
dynamo_target_client = boto3.client('dynamodb', region_name='us-west-2')
dynamo_paginator = dynamo_client.get_paginator('scan')
dynamo_response = dynamo_paginator.paginate(
TableName=source,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamo_response:
for item in page['Items']:
dynamo_target_client.put_item(
TableName=target,
Item=item
)
if __name__ == '__main__':
migrate('awesome-v1', 'awesome-v2')

On November 29th, 2017 Global Tables was introduced. This may be useful depending on your use case, which may not be the same as the original question. Here are a few snippets from the blog post:
Global Tables – You can now create tables that are automatically replicated across two or more AWS Regions, with full support for multi-master writes, with a couple of clicks. This gives you the ability to build fast, massively scaled applications for a global user base without having to manage the replication process.
...
You do not need to make any changes to your existing code. You simply send write requests and eventually consistent read requests to a DynamoDB endpoint in any of the designated Regions (writes that are associated with strongly consistent reads should share a common endpoint). Behind the scenes, DynamoDB implements multi-master writes and ensures that the last write to a particular item prevails. When you use Global Tables, each item will include a timestamp attribute representing the time of the most recent write. Updates are propagated to other Regions asynchronously via DynamoDB Streams and are typically complete within one second (you can track this using the new ReplicationLatency and PendingReplicationCount metrics).

Another option is to download the table as a .csv file and upload it with the following snippet of code.
This also eliminates the need for providing your AWS credentials to a packages such as the one #ezzat suggests.
Create a new folder and add the following two files and your exported table
Edit uploadToDynamoDB.js and add the filename of the exported table and your table name
Run npm install in the folder
Run node uploadToDynamodb.js
File: Package.json
{
"name": "uploadtodynamodb",
"version": "1.0.0",
"description": "",
"main": "uploadToDynamoDB.js",
"author": "",
"license": "ISC",
"dependencies": {
"async": "^3.1.1",
"aws-sdk": "^2.624.0",
"csv-parse": "^4.8.5",
"fs": "0.0.1-security",
"lodash": "^4.17.15",
"uuid": "^3.4.0"
}
}
File: uploadToDynamoDB.js
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
var _ = require('lodash')
var AWS = require('aws-sdk');
// If your table is in another region, make sure to update this
AWS.config.update({ region: "eu-central-1" });
var ddb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });
var csv_filename = "./TABLE_CSV_EXPORT_FILENAME.csv";
var tableName = "TABLENAME"
function prepareData(data_chunk) {
const items = data_chunk.map(obj => {
const keys = Object.keys(obj)
let attr = Object.values(obj)
attr = attr.map(a => {
let newAttr;
// Can we make this an integer
if (isNaN(Number(a))) {
newAttr = { "S": a }
} else {
newAttr = { "N": a }
}
return newAttr
})
let item = _.zipObject(keys, attr)
return {
PutRequest: {
Item: item
}
}
})
var params = {
RequestItems: {
[tableName]: items
}
};
return params
}
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
const params = prepareData(item_data)
ddb.batchWriteItem(
params,
function (err, data) {
if (err) {
console.log("Error", err);
} else {
console.log("Success", data);
}
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);

It's been a very long time since the question was posted and AWS has been continuously improvising features. At the time of writing this answer, we have the option to export the Table to S3 bucket then use the import feature to import this data from S3 into a new table which automatically will re-create a new table with the data. Plese refer this blog for more idea on export & import
Best part is that you get to change the name, PK or SK.
Note: You have to enable PITR (might incur additional costs). Always best to refer documents.

Here is another simple python util script for this: ddb_table_copy.py. I use it often.
usage: ddb_table_copy.py [-h] [--dest-table DEST_TABLE] [--dest-file DEST_FILE] source_table
Copy all DynamoDB items from SOURCE_TABLE to either DEST_TABLE or DEST_FILE. Useful for migrating data during a stack teardown/re-creation.
positional arguments:
source_table Name of source table in DynamoDB.
optional arguments:
-h, --help show this help message and exit
--dest-table DEST_TABLE
Name of destination table in DynamoDB.
--dest-file DEST_FILE
2) a valid file path string to save the items to, e.g. 'items.json'.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Access to AWS Kinesis tables in DynamoDB using AWS Cli - amazon-web-services

I am new to AWS Kinesis. Trying to learn how to build distributed application using AWS DynamoDB. Could someone tell me how to access the tables in DynamoDB used by my streams in AWS Kinesis using the AWS Cli? Can I query the tables?

Related

I am learning to create AWS Lambdas. I want to create a "chain": S3 -> 4 Chained Lambda()'s -> RDS. I can't get the first lambda to call the second

How do I run mutlple SQL statements in an AppSync resolver?

Trying to automate AMI backup of EC2 instance

AWS API gateway to fetch records from multiple tables in a single API call

Copying one table to another in DynamoDB

Categories

Resources