I am trying to implement a BI solution using GCP where I have data in flat files in cloud datastore and I have to push this data in my Data Warehouse on BigQuery. The data will be incremental after the first load.
There doesn't seem to be any ETL functionality which I can use to implement this incremental data load into my warehouse. Using Cloud Dataflow, I can push the delta load into the BigQuery tables but this approach doesn't handle the updated records correctly.
Can anyone suggest here what could be the best approach for implementing this solution?
Related
I have multiple ERPs ingesting data in S3, I have AWS glue for spark processing.
I found out, I need to have delta type files for spark processing and best way to run this ETL on EMR or Databricks.
Should I go for Databricks for incremental load and full load refresh of dashboard?
or EMR can also manage full data refresh along with update matched and insert new data features. if yes please share some info.
What I am confused about is that, if I have only new/ updated/ deleted data to process then how dashboard will show me all previous data.
I have a use case where I need to sync spanner table with Big Query tables. So I need to update the Spanner tables based on the updated data in Big Query tables. I am planning to using Cloud data fusion for this. But I do not see any example available for this scenario. Any pointers on this?
I am able to discover Bigquery datasets,GCS files in Google Data Catalog but I could not find Cloud SQl or Cloud Spanner options in Cloud Data Catalog UI.
Is it possible to view Cloud SQL tables , Cloud Spanner tables data in Data Catalog? If yes please suggest steps or provide documents links.
Thanks.
Yes, It is Possible using Data Catalog custom entries.
To view Cloud SQL tables, you can use the open source connectors for MySQL, SQL Server and PostgreSQL.
Also check the on-premise ingestion use cases from the official docs.
Yes, It is Possible
Details:
Other than Native-Metadata types GCS, PUB/SUB and BigQuery are needs to be dealt via Catalog-APIs
Ref: https://cloud.google.com/data-catalog/docs/how-to/custom-entries
ie,
Use One of 7 languages to programatically loop-through all the tables from custom data source (eg..BigTable) and create Tag Dynamically.
My Favorite Python & C#
Much appreciated if anyone else has better alternative approach
Unfortunately, there is no a native integration between Data Catalog, Cloud SQL and Cloud Spanner. Nevertheless, there is an issue tracker regarding this feature reported.
As you can see in the shared link, as a work around you can manually create a JDBC connector to Spanner and export the metadata to Data Catalog custom entries on a schedule. Something like the Mahendren's suggestion. Something similar you can perform with Cloud SQL
I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.
My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.
Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.
In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.
Giving all this context, I can see two approaches:
1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.
2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.
The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.
The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.
Is there something I am getting wrong here? Can someone enlighten me?
You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.
More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.
You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.
For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.
I want to load large size data, in google cloud bigQuery.
What are all the options at my hand (using UI and APIs) and what would be the fastest way?
TIA!
You can load data:
From Google Cloud Storage
From other Google services, such as DoubleClick and Google AdWords
From a readable data source (such as your local machine)
By inserting individual records using streaming inserts
Using DML statements to perform bulk inserts
Using a Google Cloud Dataflow pipeline to write data to BigQuery
more formats at Introduction to Loading Data into BigQuery
Loading data into BigQuery from Google Drive is not currently supported, but you can query data in Google Drive using an external table.
You can load data into a new table or partition, you can append data to an existing table or partition, or you can overwrite a table or partition. For more information on working with partitions, see Managing Partitioned Tables.
When you load data into BigQuery, you can supply the table or partition schema, or for supported data formats, you can use schema auto-detection.
Each method is fast, if your data is large, you should go with the Google Cloud Storage.
When you load data from Google Cloud Storage into BigQuery, your data can be in any of the following formats:
Comma-separated values (CSV)
JSON (newline-delimited)
Avro
Parquet
ORC (Beta)
Google Cloud Datastore backups