I'm planning a project whereby I'd be hitting the (rate-limited) Reddit API and storing data in GCS and BigQuery. Initially, Cloud Functions would be the choice, but I'd have to create a Datastore implementation to manage the "pseudo" queue of requests and GAE for cron jobs.
Doing everything in Dataflow wouldn't make sense because it's not advised the make external requests (i.e. hitting the Reddit API) and perpetually running a single job.
Could I use Cloud Composer to read fields from a Google Sheet, then create a queue of requests based on the Google Sheet, then have a task queue execute those requests, store them in GCS and load into BigQuery?
Sounds like a legitimate use case for Composer, additionally you could also leverage the pool concept in Airflow to manage concurrent calls to the same endpoint (e.g., Reddit API).
Related
I heavily use Google cloud run, for many reasons - one of the reasons is the simplicity of treating each request as stateless and handling it individually.
However I was thinking recently that for a service we have which simply writes data to a DB, it would be very handy to batch a few requests rather than write each one individually. Is this possible via serverless platforms - specifically cloud run?
Because Cloud Run is stateless, you can't stack the requests (mean keep them, so statefull) and process them later on. You need an intermediary layer for that.
On good way, that I have already implemented, is to publish the request in PubSub (either directly, or you use a CLoud Run/Cloud Function to get the request and transform it in PubSub message).
Then, you can create a Cloud Scheduler, that trigger a Cloud Run service. This Cloud Run will pull the PubSub topic and read a bunch of messages (maybe all). And then, you have all the "request" in batch and you can process them "inside the Cloud Scheduler request" (don't forget that you can't process in background with Cloud Run, you must be in a request context. -> for now ;) )
I think you can give a try to these blogs, I've done some reading and looks like you can pull some good ideas from them.
Running a serverless batch workload on GCP with Cloud Scheduler, Cloud Functions, and Compute Engine
Batching Jobs in GCP Using the Cloud Scheduler and Functions
Here is another stackoverflow thread that shows some similar approach.
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
We have Postgresql on AWS. All realtime changes from Portal UI are captured on this database. However there is a request to move these changes in realtime or near realtime to GCP.
Purpose: We want various consumers to ingest data from GCP, instead of master data source in PostgresAWS.
When a customer table (in AWS Postgres) is being inserted with a new customer record, then I want to immediately populate that record in JSON format in GCP pub sub topic.
Please let me know any reference to move a database table specific data across cloud as and when any DML event occurs?
Please note that am new to GCP and learning and exploring :)
Thanks
Databases use log shipping to update slaves/replicas. In your case, you want to update two targets (database, Cloud Pub/Sub) by having the database do the Pub/Sub update. That might be possible but will require development work.
PostgreSQL does not have a native ability to update Pub/Sub. Instead, change your requirements so that the application/service that is updating the database then updates Pub/Sub.
If you really want PostgreSQL to do this task, you will need to use PostgreSQL triggers and write a trigger function in C with the Google Cloud Pub/Sub REST API.
PostgreSQL Trigger Example
PostgreSQL Event Trigger Example
Event triggers for PostgreSQL on Amazon RDS
Cloud Pub/Sub API
I'm implementing my first pipeline for "automated" data ingestion in my company. Our client doesn't want to let us make any call in their database (even create a replica,etc). The best solution I have thought until now is an endpoint (let them push the data to a storage), so we can consume it and carry on all the data science process. My cloud provider is Google Cloud and my client uses MySQL Server.
I have been reading many topics on the web and reached the following links:
Google Cloud Data Lifecycle - For batch processing it talks a bit about Cloud Storage, Cloud Transfer Appliance, Transfer Appliance
Signed URLs - These URLs are time-limited resources to access, for example, Google Cloud Storage, and write data into it.
My simple solution is user Signed URLs -> Cloud Storage -> Dataflow -> BigQuery. Is it a good approach?
To sum up, I am lloking for recomendations about best practices and possible ways to let the user insert data in GCP without exposing his data or my infrastructure.
Contraints:
Client will send data periodically (once a day ingestion)
Data is semi-structured (I will create and internal pipeline to make
transformations)
After preprocess, data must be sent to BigQuery
Signed URLs and Dataflow may not be necessary here. Signed URLs are generally used when you don’t want users to have a Google account to access Cloud Storage but also comes with more consideration when dealing with resumable uploads. If you know your client will have a Google account when pushing the data, then it can be skipped (especially since timeouts to protect private keys are not necessary since code is running in the backend and not in a client's mobile app for example). You could simply create a basic web app with App Engine which would be used by the client to perform the daily push, which would then upload it to the Cloud Storage bucket performing a resumable upload. App Engine would also make sure the files are in a proper format and follows specific constraints you would define before uploading it.
As for Dataflow, since its best use is for streaming and in your case it’s a passive batch ingestion, paying for a service that is constantly running when you need the transform to happen only once a day may not be the best approach. More efficient would be to use Cloud Functions to pre-process and apply the transforms, which would be triggered by object change notification in the Cloud Storage bucket. The function would then push the data to BigQuery using its API.
The complete flow would be:
App Engine web app sanitizes the dump -> Storage API -> Bucket Object Change Notification -> Trigger Cloud Function (CF) -> CF downloads object -> CF performs the transform -> CF saves rows to BigQuery
GAE -> GCS -> CF -> BQ
In my opinion, gsutil can do the job to push the data to cloud storage periodically.
Generally, we can transfer files that are not too big with gsutil.
I would personally write a cron job that would include gsutil cp command to push the file from on-prem system to the cloud storage bucket.
reading from mysql and writing to the file can be done through simple spring boot job.
Mysql -> (Write to the file) -> file -> gsutil cp -> cloud storage
I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions