I've a glue job that reads from a Kinesis stream and stores the data on a governed table on lake formation.
The issue comes that I have duplicated data. Let's say the user changed is address and now I have two records for the same user when I just want the last one.
How can I achieve this?
Thanks
Related
I'm looking at the diagram here, and from my understanding, a DynamoDB stream into a redshift table via kinesis firehose will send the updates as redshift commands to the table (i.e. update, insert etc). So this will keep a redshift version of a dynamodb table in sync.
But how do you deal with the historical data? Is there a good process for filling the redshift table with data to date, that can then be kept in sync via a dynamodb stream? It's not trivial, because some updates may be lost if I manually copy the data into a redshift table and then switch on a dynamodb stream depending on the timing.
So regarding the diagram, it shows kinesis firehose delivering data to s3, queryable by athena. I feel like I'm missing sometthing because if the data going to s3 are only updates and new records, it doesn't seem like something that works well for athea (a partitioned snapshot of the entire table makes more sense).
So if I have a dynamodb table that is currently receiving data, and I want to create a new redshift table that contains all the same data up to a given time, and then gets all the updates via a dynamodb stream after, how do I go about doing that?
I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.
I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it's better to archive and remove the records you don't use. If you query the records only by the primary key, then most probably archiving doesn't worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.
I have data coming into Kinesis streams from different producers. The data needs to be transformed into several different tables depending the the they type of that data. I'm trying to have a Lambda read from Kinesis Streams, transform the data into the different tables and then use Kinesis Firehose to batch and COPY into Redshift.
However since Firehose can only deliver to one Redhsift table at a time, I need to have multiple Firehose instances running equal to the number of tables I will have in my Redshift schema. I see that there is a default cap of 20 Firehose instances per region and I'm wondering if this is an ideal data pipeline for what I am trying to achieve.
I have designed almost similar data pipeline like yours. I get the data from the kinesis, transform the data after it has reached redshift. All the data from 1 stream goes to a one master table in redshift.Then, I run a cron job, which transforms the data into different tables.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.