Database changes on AWS real time sync to GCP - amazon-web-services

We have Postgresql on AWS. All realtime changes from Portal UI are captured on this database. However there is a request to move these changes in realtime or near realtime to GCP.
Purpose: We want various consumers to ingest data from GCP, instead of master data source in PostgresAWS.
When a customer table (in AWS Postgres) is being inserted with a new customer record, then I want to immediately populate that record in JSON format in GCP pub sub topic.
Please let me know any reference to move a database table specific data across cloud as and when any DML event occurs?
Please note that am new to GCP and learning and exploring :)
Thanks

Databases use log shipping to update slaves/replicas. In your case, you want to update two targets (database, Cloud Pub/Sub) by having the database do the Pub/Sub update. That might be possible but will require development work.
PostgreSQL does not have a native ability to update Pub/Sub. Instead, change your requirements so that the application/service that is updating the database then updates Pub/Sub.
If you really want PostgreSQL to do this task, you will need to use PostgreSQL triggers and write a trigger function in C with the Google Cloud Pub/Sub REST API.
PostgreSQL Trigger Example
PostgreSQL Event Trigger Example
Event triggers for PostgreSQL on Amazon RDS
Cloud Pub/Sub API

Related

Share information when Bigquery data is up-to-date with users

I'm looking for the best place to store information when data from Bigquery table is ready for export and table is up-to-date - ready for user's queries. This information should be accesible for business users and external applications (checking will be performed e.g. every 5 minutes).
I'm going to use Cloud Composer as data workflow orchestration service but Composer metadata in Cloud SQL is accesible only for user who created Composer instance.
What are best practices to share such a data with users?
This is more like a functional requirement. Si why not at the end of each integration you add a new record in a data store. Then make that data accessible by business users. Or you can use a store like Cloud Firestore, and when you add or modify a record you can trigger a Cloud Function that can send an email.

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

AWS tech stack solution for a static website

I have a project where I am building a simple single page app, that needs to pull data from an api only once a day. I have a backend that I am thinking of building with golang, where I need to do 2 things:
1) Have a scheduled job that would once a day update the DB with the new data.
2) Serve that data to the frontend. Since the data would only be updated once a day, I would like to cache it after each update.
Since, the number of options that AWS is offering is a bit overwhelming, I am wondering what would be the ideal solution for this scenario. Should I use lambda that connects to DB and updates it with a scheduled job? Should I create then a separate REST API lambda where I would pull that data from the DB and call it from the frontend?
I would really appreciate suggestions for this problem.
Her is my suggestion;
Create a lambda function
it will fetch required information from database
You may use S3 or DynamoDB to save your content. Both of the solutions may be free please check for free tier offers depending on your usage
it will save the fetched content to S3 or DynamoDB (you may check Dax for DynamoDB caching)
Create an Api gateway and integrate it to your lambda (Elastic LoadBalancer is another choice)
Create a Schedule Expressions on CloudWatch to trigger lambda daily
Make a request from your front end to Api Gateway or ELB.
You may use Route 53 for domain naming.
Your lambda should have two separate functions, one is to respond schedule expression, the other one is to serve your content via communicating with S3/DynamoDB.
Edit:
Here is the architecture
Edit:
If the content is going to be static, you may configure a S3 bucket for static site serving and your daily lambda may write it in there when it is triggered. Then you no longer need api gateway and DynamoDB.
here is the documentation for s3 static content

gcp - Trigger Cloud Function on database insert?

Not sure how to search this; I'm looking for a way to trigger a Cloud Function whenever a new row is inserted into a database in Cloud SQL. The search for "google cloud function events" (or "triggers") turn up Firebase results, which is not what I want.
There are a series of Cloud Functions that receive data and transform it according to the clients' needs; in the end, after some manipulation, that data ends up in a table. Is there an event I can listen to so I can access the newly inserted rows? If not, I might end up using the Cloud Scheduler and peek regularly into the DB. However, this solution doesn't seem viable for long-term.
I'd appreciate any advice.
Currently there is no official Cloud Function event which could be triggered on changes to a Cloud SQL database. You can check the available events in the Events and Triggers documentation.
You could still do something like it with Cloud Pub/Sub, and it could be done in 2 ways:
1 - The first would be to enable and export logs from the Cloud SQL instance to a Pub/Sub topic by creating a sink on Stackdriver, and have the Cloud Function listen to that topic.
Although this method does not require you to change the way you are inserting data to the DB, it might expose too much information, as all queries will be logged on Stackdriver. It also means you would not have full control of what information is passed to the function, as the message would be the contents of the log entry.
2 - The ideal solution would be to create the Pub/Sub topic and publish to it when you insert new data to the database. This way you have more control over the information sent to the topic. You can find more information about how to set up a new topic in the Cloud Pub/Sub documentation.

Rate limited API requests in Cloud Composer

I'm planning a project whereby I'd be hitting the (rate-limited) Reddit API and storing data in GCS and BigQuery. Initially, Cloud Functions would be the choice, but I'd have to create a Datastore implementation to manage the "pseudo" queue of requests and GAE for cron jobs.
Doing everything in Dataflow wouldn't make sense because it's not advised the make external requests (i.e. hitting the Reddit API) and perpetually running a single job.
Could I use Cloud Composer to read fields from a Google Sheet, then create a queue of requests based on the Google Sheet, then have a task queue execute those requests, store them in GCS and load into BigQuery?
Sounds like a legitimate use case for Composer, additionally you could also leverage the pool concept in Airflow to manage concurrent calls to the same endpoint (e.g., Reddit API).