Partitioning data in eventhub using capture

Partitioning data in eventhub using capture - azure-eventhub

Is it possible to capture the eventhub data in adls gen 2 in hive folder partitioning format like date=xyz using the capture feature on the portal??

Partitioning is a feature of ADLS store and it is abstracted from Event Hubs capture. Thus, I think you should be able to configure partitions just fine.

Related

AWS Glue catalogPartitionPredicate : to_date is not working

I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:
Partition columns: Year,Month & Day
catalogPartitionPredicate: year>='2021' and month>='12'
If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.
Is there any way to implement to_date functionality or any other workaround to handle this scenario?

could you add your glue code ?
did you try running glue crawler ?

How should we configure Datastream in Google Cloud

We have c. 60 tables whose changes we would like to capture (CDC) using GCP's datastream and ingest into our data lake.
Are there any drawbacks to using Datastream? And should I set up one stream that ingests all tables? Or should I create a stream per group (or per single) table (in order to create a fail-safe that if a stream fails, it will be localized to a few/specific table)?
Thanks in advance.

How to capture data change in aws glue?

We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?
Note- We don't want to use AWS DMS.

You can leverage AWS DMS for CDC and then use the Apache IceBerg connections with Glue Data Catalog to achieve this:
https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/

I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.
Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output
https://github.com/G-Research/spark-extension/blob/master/DIFF.md

It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.
While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.
As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.

Real time ingestion to Redshift without using S3?

I'm currenty using Apache NiFi to ingest realtime data using kafka, I would like to take this data to redshift to have a near real time transacction table for online analytics of campaign results and other sutff.
The catch, if I use Kinesis or copy from S3 I would have a LOT of read/writes from/to S3, I found on previous experiences that this becomes very expensive.
So, is there a way to send data to a redshift table without constant locking the destination table? the idea is putting the data directly from NiFi without persisting it. I have a hourly batch process so it would not be a problem if I lost a couple or rows on the on online stream.
Why redshift? that's the Data Lake platform and it will become handy to cross online data with other tables.
Any idea?
Thanks!

want to write event from event hub to Data Lake store using C# without stream analytics

want to write event from event hub to Data Lake store using C# without stream analytics.
we are able to write in BLOB but how we can write in data lake.

Other than Azure Stream Analytics, there are few other options you could use. We have listed them at https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios - search for "Streamed data".
Do these meet your requirements?
Thanks,
Sachin Sheth
Program Manager,
Azure Data Lake.

You can try connecting the eventhub triggered function to blob storage as an output binding and then orchestrate a movement of data from blob to data lake.
This way we can save some efforts and process the datas in batch as well.
Considering other options such as Azure functions to Data Lake which involves complex operation such as appending the events in same file till a threshold and then flushing to data lake,it might not be ideal for real time environment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Partitioning data in eventhub using capture - azure-eventhub

Is it possible to capture the eventhub data in adls gen 2 in hive folder partitioning format like date=xyz using the capture feature on the portal??

Partitioning is a feature of ADLS store and it is abstracted from Event Hubs capture. Thus, I think you should be able to configure partitions just fine.

Related

AWS Glue catalogPartitionPredicate : to_date is not working

How should we configure Datastream in Google Cloud

How to capture data change in aws glue?

Real time ingestion to Redshift without using S3?

want to write event from event hub to Data Lake store using C# without stream analytics

Categories

Resources