Automating dynamodb scripts - amazon-web-services

Automating dynamodb scripts - amazon-web-services

Like we used to do with rdbms sql scripts. I wanted to do a similar thing with my dynamodb table.
Currently its very difficult to track changes from environment to environment(dev - qa -prod). We are directly making changes via the console.
What I want to do is, Keep the table data/json in the git version control and whenever any dev makes a change, we should be able to just run a script that will be able migrate the respective changes to on the dynamodb table eg. update/create/delete the tables, add/remove/update the records.
But I am not able to find a proper way/guide to achieve this currently. I am using javascript/nodejs as our base language.
Any help regarding this scenario will be appriciable.
Thanks
ref : https://forums.aws.amazon.com/thread.jspa?threadID=342538

As far as I can tell you described to separate issues:
Changing the tables "structure"
Updating records after the update
Before I go into my answer, remember that DynamoDB is a NoSQL database and your previous RDBMS was a relational database. Operational tasks can differ very much for both types of databases.
1. Automating changes to the "structure" of the table
For this you can check out Infrastructure as Code tools like Terraform, CloudFormation or Pulumi.
But since DynamoDB is a NoSQL database, you only can do a few things like setting your hash and sort key etc and defining indices. Adding "fields" to the DynamoDB is not done with those tools, because except for the hash and sort key, there are no fields. Everything else does not follow a explicit (sql) schema.
2. Updating records after an update
If you do not have a lot of records, you could write yourself a simple tool or script to do the relevant work using the AWS SDK and run that during your CI/CD pipeline. A simple approach would be to have a "migrations" folder and if there is a file in it, the pipeline will execute it. So after the migration is done, just remove the file again. Not great, but pragmatic.
If you have a lot of records this won't work that great anymore, at least if you want to have a downtime-less deployment. In that case you will have update your software to be able to work with the old and new versions of the records structure, while you gradually update all records in the background (using a script etc.). Once all the records are updated, you can remove the code paths that handle the old structure.

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!

I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

AWS Neptune Change Management

we are considering using AWS Neptune as graphdb solution.
I am coming from Django world so I used to use db migrations a lot.
I could not find any info about how AWS Neptune does change management on DB?
ie. what happens if I want to reload a backup from a month ago and there has been schema changes since then? How do we track these changes?
Should we write custom scripts?

Unlike something like an RDBMS and some other data stores, Amazon Neptune, and many other graph dbs for that matter, are called "schemaless" meaning there is no need to explicitly define or maintain a schema. The schema is implicitly defined by the data stored in the database. In the case you mentioned, restoring a backup, there is no need for a migration/change script to be run. When you restore the backup the schema will be defined by the restored data.
This "schemaless" nature of the database allows applications to begin adding new entity types and data properties without any sort of ETL process. However, this also means that the application does need to manage some sort of schema internally to maintain sanity over the data being stored (e.g. first_name and firstName could be used and would be separate properties.).

Single table db architecture with AWS Amplify

By default AWS Amplify transformers creating tables per each graphql type.
But according DynamoDB documentation it's best practice to
Keep tables few as possible
Keep often queried together entries within a same table
I have an impression Amplify way of doing things stays in contradiction with the statement above.
I am new to both NoSQL and Amplify
Can someone suggest ways to address those issues?

I think we're in a bit of a transition or gray area here. I'm very new to Amplify and have been investigating moving to a single-table design as there are sources (below) that indicate that it's always be there but you'd have to write everything in VTL templates. But in 2020 they released direct lambda resolver support: https://youtu.be/EOQqi6Yun7g?t=960 (clip)
However, it seems like you lose access to the #auth directive (and probably others because you're no longer going to use #model) along with a lot of the nice out-of-the-box functionality that's available with Amplify's multi-table approach.
At this point, being that I'm developing a new app, I'm going to stick with the default multi-table design to hasten the process of getting the app functional.
Trying to implement the single-table design seems to go against what the Amplify team recommends and requires more manual work. You'd have to manually create custom lambda functions (AppSync) and code queries to DynamoDb for each data access element and manage authorization through some other means which I'm not aware of at this time. Maybe someone can chime in here...
Single table vs multi table info
Using Amplify with single table:
https://youtu.be/EOQqi6Yun7g
Single vs Multi Clip:
https://youtu.be/1WF_wped808?t=1251 (clip)
https://www.alexdebrie.com/posts/dynamodb-single-table/ (towards bottom)
https://youtu.be/EOQqi6Yun7g?t=1288 (clip)
Example single table design by Alex Debrie:
https://gist.github.com/dabit3/96dc51e688b18a7d40fc534331758c56
More Discussion:
https://stackoverflow.com/a/56438716/1956540
Basic Setup steps
I setup a single table by following the below instructions. Again, you don't use #models for this. Also, I think you have to include a type query {} in your schema for it to compile, but I could be wrong here.
So the basic steps are:
Create a single table (amplify add storage)
amplify push
Create your schema in the schema.graphql file.
Create supporting lambda function (amplify add function)
Note: if you look at the example here, I believe you can create an entry point to routes to all other methods: https://gist.github.com/dabit3/96dc51e688b18a7d40fc534331758c56#lambda
Add the DynamoDb query code in the function.
amplify push
Complete steps for Setting up a single Table:
https://catalog.us-east-1.prod.workshops.aws/workshops/53b10bf8-2271-4ab4-bfd2-39e878a90dc8/en-US/lab2/1-vtl (both "Connecting to an existing DynamoDB table" and "Direct Lambda Resolver" steps)
Not trying to be negative about Amplify, it is awesome, I love what they are doing with this product. I just think it's very new to everyone and I'm hoping this post is no longer valid next year and we continue to see great progress from the team.

Do I need to update item csv in AWS personalize?

I'm trying to use AWS personalize, and following their documents.
So I've uploaded dataset files(interaction, user, item) to S3, then created a solution and a campaign.
And I implemented PutEvents API using java.
GetRecommendations API call works good.
At this moment I'm curious I need to update dataset files, especially item csv.

In general it's done at this point for very basic recommendations.
Since you are using PutEvents call, then all of the real-time events are added to Interactions dataset this way. Interactions datasets created by manual import and by PutEvents calls are separated from themselves. You can actually see them in Personalize Datasets web console.
Still you might want to update dataset files, using dataset import job feature, but it's going to replace your existing dataset. In general I would recommend using it only when:
You just created a fresh/bigger/better dump of your database with Interactions.
You've found, that your previous interactions dataset was invalid.
The schema of dataset changed (pretty much you are forced to do it then).
User or Item dataset changed/improved, it's actually a good idea to refresh it often, so Personalize can produce better recommendations. Keep in mind, that it also requires retraining of the Solution, so the new Items/Users will be included during the recommendations generation.
So for interactions you usually don't want to update dataset. For other datasets it might be a good idea to even create an automatic import mechanism.
Keep in mind, that Items and Users datasets are used only with Personalize Recipes, that support metadata. Otherwise they are simply ignored.

Strategy for Updating Schema/Data of Data Stored in AWS S3

At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.

Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js