I'm working on GCP data pipeline. I'm using Dataflow, Dataproc, and destination is big query. I have created multiple data pipelines in AWS and this is my first time in GCP.
In AWS I used deeq with Glue for data validation after landing files in the staging area; so my question is do we have any similar deeq service in GCP which I can use for data validation or DataProc,Dataflow or any other service will do it by own.
Thanks
Related
I have multiple ERPs ingesting data in S3, I have AWS glue for spark processing.
I found out, I need to have delta type files for spark processing and best way to run this ETL on EMR or Databricks.
Should I go for Databricks for incremental load and full load refresh of dashboard?
or EMR can also manage full data refresh along with update matched and insert new data features. if yes please share some info.
What I am confused about is that, if I have only new/ updated/ deleted data to process then how dashboard will show me all previous data.
I implement a dataflow job on terraform, using Google template Pubsub to Big Query. Pubsub is in one project, while dataflow and big query is in the other. The dataflow job is created, compute engine scales, subscriptions get created, service account has all possible permissions to run dataflow job and pubsub and service account user permissions in project where pubsub is. Pipeline API is enbled. Dataflow job is with the status running, big query tables are created, table scemas match the message schema. The only thing is that dataflow doesn't read messages from pubsub. The only thing is, maybe, when I open pipelines (within dataflow), I see nothing, also temp location specified in terraform code is not created. Service account has cloud storage admin permissions, so it's another indication that dataflow job (pipeline) just doesn't initiate the stream. Any suggestions? Maybe somebody had similar issue?
enter image description here
enter image description here
We have Postgresql on AWS. All realtime changes from Portal UI are captured on this database. However there is a request to move these changes in realtime or near realtime to GCP.
Purpose: We want various consumers to ingest data from GCP, instead of master data source in PostgresAWS.
When a customer table (in AWS Postgres) is being inserted with a new customer record, then I want to immediately populate that record in JSON format in GCP pub sub topic.
Please let me know any reference to move a database table specific data across cloud as and when any DML event occurs?
Please note that am new to GCP and learning and exploring :)
Thanks
Databases use log shipping to update slaves/replicas. In your case, you want to update two targets (database, Cloud Pub/Sub) by having the database do the Pub/Sub update. That might be possible but will require development work.
PostgreSQL does not have a native ability to update Pub/Sub. Instead, change your requirements so that the application/service that is updating the database then updates Pub/Sub.
If you really want PostgreSQL to do this task, you will need to use PostgreSQL triggers and write a trigger function in C with the Google Cloud Pub/Sub REST API.
PostgreSQL Trigger Example
PostgreSQL Event Trigger Example
Event triggers for PostgreSQL on Amazon RDS
Cloud Pub/Sub API
My organization is evaluating options of Hybrid Data Warehouse using AWS Redshift and S3. Objective is to process the data on-premises and send processed copy to S3 and then load to Redshift for visualization.
As we are in initial stages, there is no file/storage gateway setup yet.
Initially we used Informatica Cloud tool to upload data from on-premises server to AWS S3, but was taking long time. Data volume is few hundred million records in history and few thousand records in daily incremental.
Now I have created custom UNIX scripts using AWS CLI and using CP command to transfer files between on-premises server and AWS S3 in gzip compressed format.
This option is working fine.
But would like to understand from experts, if this is the right way of doing it or if there are any other optimized approaches available to achieve this.
If the volume of your data is more than 100 mb then AWS suggest to use Multipart upload for better performance.
You can refer the below to get the benefit of this
AWS Java SDK to upload large file in S3
In our infrastructure we have a bunch of pipelines for ETL data before pushing them into Redshift. We use s3 bucket for logs and SNS alerting for activities. Most of that activities are standard CopyActivity, RedshiftCopyActivity and SqlActivity.
We want to get all available metrics for this activities to dashboard them (E.g.: Cloudwatch) so we know what's going on that side in one single place. Unfortunately I didn't find much information on AWS documentation for that and have to do all that manually in code.
What is the most common way for monitoring AWS Data Pipeline?