We have c. 60 tables whose changes we would like to capture (CDC) using GCP's datastream and ingest into our data lake.
Are there any drawbacks to using Datastream? And should I set up one stream that ingests all tables? Or should I create a stream per group (or per single) table (in order to create a fail-safe that if a stream fails, it will be localized to a few/specific table)?
Thanks in advance.
Related
We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?
Note- We don't want to use AWS DMS.
You can leverage AWS DMS for CDC and then use the Apache IceBerg connections with Glue Data Catalog to achieve this:
https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/
I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.
Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output
https://github.com/G-Research/spark-extension/blob/master/DIFF.md
It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.
While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.
As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.
I'm currenty using Apache NiFi to ingest realtime data using kafka, I would like to take this data to redshift to have a near real time transacction table for online analytics of campaign results and other sutff.
The catch, if I use Kinesis or copy from S3 I would have a LOT of read/writes from/to S3, I found on previous experiences that this becomes very expensive.
So, is there a way to send data to a redshift table without constant locking the destination table? the idea is putting the data directly from NiFi without persisting it. I have a hourly batch process so it would not be a problem if I lost a couple or rows on the on online stream.
Why redshift? that's the Data Lake platform and it will become handy to cross online data with other tables.
Any idea?
Thanks!
I have 3 mongodb collections (from Same Database) which are act as input source. Currently I'm useing 3 kinesis streams to access these collections. I need to analyze them by combining them. Can I use Kinesis Analytics to do this? Because i can't see an option to select multiple streams as inputs for Kinesis Analytic app.
Kinesis Analytics does not have the feature to add more than one streaming source and one reference data source as yet.
See: http://docs.aws.amazon.com/kinesisanalytics/latest/dev/limits.html
You can use Drools Kinesis Analytics, which supports multiple input streams
want to write event from event hub to Data Lake store using C# without stream analytics.
we are able to write in BLOB but how we can write in data lake.
Other than Azure Stream Analytics, there are few other options you could use. We have listed them at https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios - search for "Streamed data".
Do these meet your requirements?
Thanks,
Sachin Sheth
Program Manager,
Azure Data Lake.
You can try connecting the eventhub triggered function to blob storage as an output binding and then orchestrate a movement of data from blob to data lake.
This way we can save some efforts and process the datas in batch as well.
Considering other options such as Azure functions to Data Lake which involves complex operation such as appending the events in same file till a threshold and then flushing to data lake,it might not be ideal for real time environment.
My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.
Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.