Data Pipeline - Dynamo DB export - amazon-web-services

Data Pipeline - Dynamo DB export - amazon-web-services

I have a table in DynamoDB which has millions of records. I have created a secondary index (GSI) based on the criteria and filtering products based on that. Now, I wanted to use AWS datapipe line to query products from the table and export it to S3.
Questions:
a) Can we Specify GSI name in the pipeline - Because querying on a large table using data pipeline is getting cancelled because of timeout issue. [The pipeline configuration has 6 hrs max wait time, it is reaching that and getting cancelled]?
b) Is there any better way to create an export dumps from the table quickly using the GSI index?
Please share your views.
Regards,
Kishore

You cannot specify the GSI in the pipeline. The list of available options you can specify for a dynamodb node are given here. The data pipeline service actually creates an EMR cluster for the export job which uses parallel table scans. You could try using a larger instance size for your nodes to speed it up the process.
Since your table has millions of records, make sure you have provisioned enough read throughput. Even if your provisioned throughput is high, the speed of export depends on what percent of the provisioned throughput is allocated for the export job. This is described in the AWS pipeline documentation here.

Related

ETL Glue delta or incremental logic

Working on a project where we need to have an incremental load on daily basis, We are using Glue for the ETL purpose. We are getting duplicates or data getting doubled using Glue.
pipeline flow: Ingestion Zone, Raw Zone, Curated zone, consumption zone.
History: 1000 records. Below dates on updates and inserts
End of the Jan-11 run, I would like to see my total records of 1100 records as I'm upserting the data in rawtocurated zone. However, I'm getting the doubled-up records in the curated zone. The data is partitioned on a run date basis. like 2020/01/10/data.csv and 2020/01/11/data.csv
What changes should I make to avoid only the delta records (or) incremental records to be seen in the comsumption zone?

As per my understanding of problem statement : Glue job bookmarks feature is used along with the meta data catalog tables to ensure only new data is processed .
Few Queries :
Is your curated zone build on top of s3 or any other RDS services provided ?
Is is direct updates or SCD-2 data transformation ?
Have you by any chance reset/paused/disable the Job bookmarks ?
If you say data is partitioned on run-date basis so partitioning is applicable on ingestion layer [multiple date specific folders under on S3 bucket and data maintained in parquet format ] or on target curated layer ?
Even if that is not solving your problem that I would recommend you to write custom spark code using either pyspark/scala encapsulating your processing logics

AWS update Athena meta: Glue Crawler vs MSCK Repair Table

When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?

MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.
Here is pricing information:
Glue Crawler:
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
Pricing
For all AWS Regions where AWS Glue is available:
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Athena:
There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.
However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?
My opinion is it is best to manage the partition yourself if you are able to, after adding new data.
'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'
If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.

I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from #KiteCoder, or you can do this via the Glue API directly.
Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.
The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?

In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

DynamoDB to Redshift

I am asking this in context of loading data from DynamoDb into Redshift. Per the Redshift Docs:
To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not load data from Amazon DynamoDB tables that are in production environments.
My data is in Production, so how do I get it out of there?
Alternately is DynamoDB Streams a better overall choice to move data from DynamoDB into Redshift? (I understand this does not add to my RCU cost.)

The warning is due to the fact that the export could consume much of your read capacity for a period of time, which would impact your production environment.
Some options:
Do it at night when you don't need as much capacity
Set READRATIO to a low value so that it consumes less of the capacity
Temporarily increase the Read Capacity Units of the table when performing the export (you can decrease capacity four times a day)
DynamoDB Streams provides a stream of data representing changes to a DynamoDB table. You would need to process these streams using AWS Lambda to send the data somewhere for loading into Redshift. For example, you could populate another DynamoDB table and use it for importing into Redshift. Or, you could write the data to Amazon S3 and import from there into Redshift. However, this involves lots of moving parts.

Using AWS Data pipeline, you can do a bulk copy data from DynamoDB to a new or existing Redshift table.

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?

Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.

The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift

This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).

If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js