How to build tenant level metrics counter for GCP dataflow jobs? - google-cloud-platform

Currently i am trying to create custom metrics for GCP dataflow job using apache beam Metrics and wanted to check if we can track/group counters based on tenant. For instance we have events generated by multiple tenants and all the events are processed in a dataflow job (which writes to big table) and i want to add metrics filter to group them by tenant so we could see elmentsAdded at tenant level.

This is not currently possible with Beam Metrics; they don't have the ability to set extra metric fields such as tenant.
However, you can use the Cloud Monitoring API directly from you pipeline code to export data into Cloud Monitoring with any schema you'd like

Related

GCP batch Data Pipeline

I'm working on GCP data pipeline. I'm using Dataflow, Dataproc, and destination is big query. I have created multiple data pipelines in AWS and this is my first time in GCP.
In AWS I used deeq with Glue for data validation after landing files in the staging area; so my question is do we have any similar deeq service in GCP which I can use for data validation or DataProc,Dataflow or any other service will do it by own.
Thanks

GCP dataflow job to transfer data from pubsub (in one project) to biq query in another, implemented on terraform, doesn't read messages

I implement a dataflow job on terraform, using Google template Pubsub to Big Query. Pubsub is in one project, while dataflow and big query is in the other. The dataflow job is created, compute engine scales, subscriptions get created, service account has all possible permissions to run dataflow job and pubsub and service account user permissions in project where pubsub is. Pipeline API is enbled. Dataflow job is with the status running, big query tables are created, table scemas match the message schema. The only thing is that dataflow doesn't read messages from pubsub. The only thing is, maybe, when I open pipelines (within dataflow), I see nothing, also temp location specified in terraform code is not created. Service account has cloud storage admin permissions, so it's another indication that dataflow job (pipeline) just doesn't initiate the stream. Any suggestions? Maybe somebody had similar issue?
enter image description here
enter image description here

Is it Possible to Build a REST API interface on top of the Spark Cluster?

Essentially, we are running a batch ML model using a Spark EMR cluster on AWS. There will be several iterations of the model so we want to have some sort of model metadata endpoint on top of the Spark cluster. In this way, other services that rely on the output of the EMR cluster can ping the spark cluster's REST API endpoint and be informed of the latest ML system version it's using. I'm not sure if this is feasible or not.
Objective:
We want other services to be able to ping the EMR cluster which runs the latest ML model and obtain the metadata for the model, which includes ML system version.
If I have understood correctly, you want to add metadata (e.g., version, last-updated, action performed etc) somewhere once the spark job is finished, right?
There can be several possibilities and all will be somehow integrated into your data pipeline in the same way as other task, for example, triggering spark job with workflow management tool (airflow/luigi), lambda function or even cron.
Updating meta-data after Spark job runs
So for the post spark job step, you can have add something in your pipeline that adds this metadata to some DB or event store. I am sharing to options and you can decide which one is more feasible
Utilize cloudwatch event and associate a lambda with the event. Amazon EMR automatically sends events to a CloudWatch event stream
Add a step in your workflow management tool (airflow/luigi) that triggers a DB/event-store update step/operator "on-completion" of the EMR step function. (for e.g., using EmrStepSensor in Airflow to issue next step of writing to DB depending on that)
For Rest-api on top of DB/event store
Now, once you have regular updating mechanism in place for every emr spark step run, you can build normal rest API using EC2 or a serverless API using AWS lambda. You will essentially be returning this meta-data from the rest service.

Database changes on AWS real time sync to GCP

We have Postgresql on AWS. All realtime changes from Portal UI are captured on this database. However there is a request to move these changes in realtime or near realtime to GCP.
Purpose: We want various consumers to ingest data from GCP, instead of master data source in PostgresAWS.
When a customer table (in AWS Postgres) is being inserted with a new customer record, then I want to immediately populate that record in JSON format in GCP pub sub topic.
Please let me know any reference to move a database table specific data across cloud as and when any DML event occurs?
Please note that am new to GCP and learning and exploring :)
Thanks
Databases use log shipping to update slaves/replicas. In your case, you want to update two targets (database, Cloud Pub/Sub) by having the database do the Pub/Sub update. That might be possible but will require development work.
PostgreSQL does not have a native ability to update Pub/Sub. Instead, change your requirements so that the application/service that is updating the database then updates Pub/Sub.
If you really want PostgreSQL to do this task, you will need to use PostgreSQL triggers and write a trigger function in C with the Google Cloud Pub/Sub REST API.
PostgreSQL Trigger Example
PostgreSQL Event Trigger Example
Event triggers for PostgreSQL on Amazon RDS
Cloud Pub/Sub API

Monitoring AWS Data Pipeline

In our infrastructure we have a bunch of pipelines for ETL data before pushing them into Redshift. We use s3 bucket for logs and SNS alerting for activities. Most of that activities are standard CopyActivity, RedshiftCopyActivity and SqlActivity.
We want to get all available metrics for this activities to dashboard them (E.g.: Cloudwatch) so we know what's going on that side in one single place. Unfortunately I didn't find much information on AWS documentation for that and have to do all that manually in code.
What is the most common way for monitoring AWS Data Pipeline?