Pros and Cons of Google Dataflow VS Cloud Run while pulling data from HTTP endpoint - google-cloud-platform

This is a design approach question where we are trying to pick the best option between Apache Beam / Google Dataflow and Cloud Run to pull data from HTTP endpoints (source) and put them down the stream to Google BigQuery (sink).
Traditionally we have implemented similar functionalities using Google Dataflow where the sources are files in the Google Storage bucket or messages in Google PubSub, etc. In those cases, the data arrived in a 'push' fashion so it makes much more sense to use a streaming Dataflow job.
However, in the new requirement, since the data is fetched periodically from an HTTP endpoint, it sounds reasonable to use a Cloud Run spinning up on schedule.
So I want to gather pros and cons of going with either of these approaches, so that we can make a sensible design for this.

I am not sure this question is appropriate for SO, as it opens a big discussion with different opinions, without clear context, scope, functional and non functional requirements, time and finance restrictions including CAPEX/OPEX, who and how is going to support the solution in BAU after commissioning, etc.
In my personal experience - I developed a few dozens of similar pipelines using various combinations of cloud functions, pubsub topics, cloud storage, firestore (for the pipeline process state managemet) and so on. Sometimes with the dataflow as well (embedded into the pipelieines); but never used the cloud run. But my knowledge and experience may be not relevant in your case.
The only thing I might suggest - try to priorities your requirements (in a whole solution lifecycle context) and then design the solution based on those priorities. I know - it is a trivial idea, sorry to disappoint you.

Related

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ?
Im a kinder garden student in GCP :)
Thanks in advance
There are many solutions to achieve this. It depends on several factors some of which are:
frequency of data ingestion
whether or not the data needs to be
manipulated before being written into bigquery (your files may not
be formatted correctly)
is this going to be done manually or is this going to be automated
size of the data being written
If you are just looking for an ETL tool you can find many. If you plan to scale this to many pipelines you might want to look at a more advanced tool like Airflow but if you just have a few one-off processes you could set up a Cloud Function within GCP to accomplish this. You can schedule it (via cron), invoke it through HTTP endpoint, or pub/sub. You can see an example of how this is done here
After several tries and datalake/datawarehouse design and architecture, I can recommend you only 1 thing: ingest your data as soon as possible in BigQuery; no matter the format/transformation.
Then, in BigQuery, perform query to format, clean, aggregate, value your data. It's not ETL, it's ELT: you start by loading your data and then you transform them.
It's quicker, cheaper, simpler, and only based on SQL.
It works only if you use ONLY BigQuery as destination.
If you are starting from scratch and have no legacy tools to carry with you, the following GCP managed products target your use case:
Cloud Data Fusion, "a fully managed, code-free data integration service that helps users efficiently build and manage ETL/ELT data pipelines"
Cloud Composer, "a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines"
Dataflow, "a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing"
(Without considering a myriad of data integration tools and fully customized solutions using Cloud Run, Scheduler, Workflows, VMs, etc.)
Choosing one depends on your technical skills, real-time processing needs, and budget. As mentioned by Guillaume Blaquiere, if BigQuery is your only destination, you should try to leverage BigQuery's processing power on your data transformation.

Approach for syncing data from different CRM/MAP tools for different clients

We are bulding a highly efficient marketing automation tool.
What we want is to periodically sync data/modified data from our customer's CRM/MAP tools for salesforce, marketo, hubspot, etc.
While we have the REST APIs and all to do that, the biggest challenge we are facing is on architectural side. We are completely on AWS.
We have used sqs-lambda approach - but maintaing parallel syncing is a challenge. Also, it is difficult to modify changes with this approach as and when the number of customers and their integrations with CRM/MAP tools increase.
We are trying to use Airflow (MWAA) with FARGATE now - But we are facing some issues there as well.
Note:
Some syncs are being processed and its time taking (Hence, Lambda is
not a feasible approach)
Please suggest some good approach to do this efficiently on a daily basis.

Can Cloud Workflows be used for both orchestration AND transformation?

I hope you guys are doing well.
We are evaluating some solutions (Apache Camel K and the likes) to allow teams to:
Low Code protocol transformation (Kafka, FTP, S3, MQ, SOAP, SFTP, gRPC, GraphQL, etc.) One team in particular has to integrate their product with 100s of external partners (each one uses a different integration technology), and writing each integration "by hand" would be a waste of time/motivation.
Enrich integrations' payloads (by calling both internal and external services)
Pay per execution/transformation/step (SERVERLESS)
Orchestrate processes that span multiples domain/services (On either our GCP account or Partners external Datacenters)
Strong retry and monitoring capabilities
Be part of our CI/CD pipeline (and not be limited to a Graphical interface)
The items in bold seem to be part of what Cloud Workflows does natively, but are the other requirements something that can be added to (or achieved with) GCW to keep it "serverless"? Please.
Any help would be appreciated.
Thanks
Cloud Workflow can perform basic transformation (on string or date), but I can't recommend that. It's better to have a Cloud Functions or a Cloud Run that perform a transformation with code. You will be able to write unit test on it and to ensure the quality and the evolution of your system without regressions.
For orchestration, it's the purpose of Cloud Workflows. Now, there is also some limits, or some corner case less easy to achieve with it. It depends on the complexity of your process and your expectations (observability, portability, replayability,...)

Can we use Google cloud function to convert xls file to csv

I am new to google cloud functions. My requirement is to trigger cloud function on receiving a gmail and convert the xls attachment from the email to csv.
Can we do using GCP.
Thanks in advance !
Very shortly - that is possible as far as I know.
But.
You might found that in order to automate this task in a reliable, robust and self-healing way, it may be necessary to use half a dozen cloud functions, pubsub topics, maybe a cloud storage, maybe a firestore collection, security manager, customer service account with relevant IAM permissions, and so on. Maybe more than a dozen or two dozens of different GCP resources. And, obviously, those cloud functions are to be developed (I mean the code is to be developed). All together that may be not a very easy or quick to implement.
At the same time, I personally saw (and contributed to a development of) a functional component, based on cloud functions, which together did exactly what you would like to achieve. And that was in production.

Google Cloud Spanner: Want Java API for doing my own retries

This is really a question for the Google Cloud Spanner Java API team...
Looking at the new Google Cloud Spanner service, it appears that the only way to perform read/write transactions is by providing a callback, via the TransactionRunner interface.
I understand that the API is trying to hide the details of the need to automatically retry transactions as a convenience to the programmer, but this limitation is a serious problem, at least for me. I need to be able to manage the transaction lifecycle myself, even if that means I have to perform my own retries (e.g., based on catching some sort of "retryable" exception).
To make this problem more concrete, suppose you wanted to implement Spring's PlatformTransactionManager for Google Cloud Spanner, so as to fit in with your existing code, and use your existing retry logic. It appears impossible to do that with the current Java API.
It seems like it would be easy to augment the API in a backward compatible way, to add a method returning a TransactionContext to the user, and let the user handle the retries.
Am I missing something? Can this alternate (more traditional) transaction API style be added to the Java API?
You are right in that TransactionRunner is the only way to do Read write transactions in the Java Client for Cloud Spanner. We believe that most users would prefer using that vs hand rolling their own retry logic. But we realize that it might not fit the needs of all the users and would love to hear about such use cases. Can you please file a feature request and we can further discuss there.