Issues while working with Amazon Aurora Database - amazon-web-services

My Requirments:
I want to store real-time events data coming from e-commerce websites into a database
In parallel to storing the data, i want to access the events data from a database
I want to perform some sort of ad-hoc analysis(SQL)
Using some sort of built-in methods(either from Boto3 or JAVA SDK), I want to access the events data
I want to create some sort of Custom-API's to access events data stored in database
I recently came across with Amazon Aurora(mysql) database.
I thought Aurora is one of the good example for my requirements. But when I dig into this Amazon Aurora(mysql), I noticed that we can create a database using AWS-CDK
BUT
1. No equivalent methods to create tables using AWS-CDK/BOTO3
2. No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Can anyone tell me how i can store realtime data into AURORA?
Can anyone tell me how i can access realtime data stored in AURORA?

No equivalent methods to create tables using AWS-CDK/BOTO3
This is because only Aurora Serveless can be accessed using Data API, not regular database.
You have to use regular mysql tools (e.g., mysql cli, phpmyadmin, mysql workbench etc) to create tables and populate them.
No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Same reason and solution as for point 1.
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Terraform has mysql, but its not for tables, but users and databases.
Can anyone tell me how i can store realtime data into AURORA?
There is no out-of-the box solution for that, so you need custom solution for that. Maybe stream data to Kinesis Streams or Firehose, then to lambda and lambda will populate your DB? Seems easiest to implement.
Can anyone tell me how i can access realtime data stored in AURORA?
If you stream data to Kinesis Stream first, you can use Kinesis Analytics to analyze it in real time.
Since many of the above requires custom solutions, other architectures are possible.

Create connectoin manager as
DriverManager.getConnection(
"jdbc:mysql://localhost:3306/$dbName", //replace here with you endpoints & database name
"root",
"admin123"
) then
val stmt: Statement = con.createStatement()
stmt.executeQuery("use productcatalogueinfo;")
Whenever your lambda is triggering then it performs this connection and DDL operations too.

Related

Migrate and Transform MySQL rows from one table to another

I need to migrate all records 3 billions from one MySQL Aurora table to 5 different tables in same cluster .
There are transformation of 2 columns is also has to happen .
So when we migrate we need to convert xml to json and then json will be stored in one of the destination table .
We are looking for best way to migrate this data from one MySQL table to another and we are on AWS so we have flexibility to use any services which can help us achieve this .
So far this is what we have planned
MySQL TABLE ----DMS------>S3 ------LAMBDA to convert XML to JSON and create 5 types of files ---->Lambda on file create and Load data local to 5 Different MySQL table .
But one thing we would like to know how can we handle if Load data local fails in between ?So Lambda will submit the query for load data local from s3 to MySQL but how can we track in Lambda that Load data local success or failure ?
We can not use any direct way because we need to transform data in between .
Is there any better way we can use here ?
Can we use Data pipeline in place of Lambda function for load data local?
Or can we use DMS which will upload file from S3 to MySQL ?
Please suggest what can be the best way which will capability to handle failure scenario
What you are basically doing is an ETL process. I would advise you to look into either AWS EMR or AWS Glue. Since you don't seem to have that much experience, I would use Glue.
With Glue you could basically read from MySQL, do the transformation and write back directly to MySQL. Also, since Glue is running Spark in the background, you can leverage it's distributed computing, which will speed up your process instead of using a single thread lambda function.

AWS Neptune Change Management

we are considering using AWS Neptune as graphdb solution.
I am coming from Django world so I used to use db migrations a lot.
I could not find any info about how AWS Neptune does change management on DB?
ie. what happens if I want to reload a backup from a month ago and there has been schema changes since then? How do we track these changes?
Should we write custom scripts?
Unlike something like an RDBMS and some other data stores, Amazon Neptune, and many other graph dbs for that matter, are called "schemaless" meaning there is no need to explicitly define or maintain a schema. The schema is implicitly defined by the data stored in the database. In the case you mentioned, restoring a backup, there is no need for a migration/change script to be run. When you restore the backup the schema will be defined by the restored data.
This "schemaless" nature of the database allows applications to begin adding new entity types and data properties without any sort of ETL process. However, this also means that the application does need to manage some sort of schema internally to maintain sanity over the data being stored (e.g. first_name and firstName could be used and would be separate properties.).

Strategy for Updating Schema/Data of Data Stored in AWS S3

At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars

Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose,
How should i collect the data ?
Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ?
What i mean is:
- If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3...
- If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .
What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?
On any case what are the best practices to avoid data duplication in RedShift ?
Appreciate the help!
It is preferred to aggregate event logs before ingesting them into Amazon Redshift.
The benefits are:
You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.
You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.
You can accumulate your events in several places before aggregating and loading them into Redshift:
Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.
DynamoDB - as #Swami described, DynamoDB is a very good way to accumulate the events.
Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.
Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).
See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490
Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.
Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.
However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.
Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.
One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.
For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
and this blog:
http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html
For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html
Hope this helps.
Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...
This is probably the easiest and most reliable way to do so.
You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.
In my CSV_Loader_For_Redshift I do just that:
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.
They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.
You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.
This solves the purpose of aggregation of logs. This setup will handle all of that for you.
You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.