Could anyone help me in providing some pointers on how do we implement Data lineage on a DW type solution built on Google BigQuery using Google Cloud storage as source and Google Cloud Composer as the workflow manager to implement a series of SQL's?
If you have your data in Cloud Storage, you would maybe like to use something like GoogleCloudStorageToBigQueryOperator to first load your data in Bigquery, then use BigQueryOperator to run your queries.
Then you could see how your different DAGs,tasks etc are running in the Airflow Web UI inside Composer.
Related
I need to design and display a compute engine snapshot report for different projects in the cloud in data-studio. For this, I am trying to use the below Google Compute Engine snapshot-api for retrieving data.
https://compute.googleapis.com/compute/v1/projects/my-project/global/snapshots
The data may change everyday depending on the snapshots created from the disks. So the report should display all the updated data.
Can this rest-api be called directly from Google data-studio?
Alternatively, what is the best/simplest way to display the response in data-studio?
You can use a Community Connector in Data Studio to directly pull the data from the API.
Currently, their is no way to connect GCP Compute Engine (GCE) resource data or use the REST API in Data Studio. The only products that are available on connecting data from GCP are the following:
BigQuery
Cloud Spanner
Cloud SQL for MySQL
Google Cloud Storage
MySQL
PostgreSQL
A possible way to design and display a Compute Engine Snapshot Report for different projects in the Cloud in Data Studio is by creating a Google App Script (to call the snapshot REST API) with a Google Sheet, and then import the data into the sheet on Data Studio.
Additionally, if you have any questions in regards to Data Studio, I would suggest reviewing the following documents below:
Data Studio Help Center
Data Studio Help Community
EDIT: My apologies, it seems that their is a way to show snapshot API response data in Data Studio by using a Community Connector to directly pull the data from the API.
I am totaly new to the cloud in any way. I started some weeks ago with the Azure cloud and we setting up a project using many different products of Azure.
At the moment we think about setting up the project on a way that we are not trapped by Microsoft and are able to switch to GCP or AWS. For most products we use I have found similar ones in the other Clouds but I wonder if there is somthing like Azure Data Factory in AWS or CGP? I could not find something in my first google research.
Best and thanks for your help
If you need a good comparison between different cloud (Azure, AWS, Google, Oracle, and Alibaba) use this site: http://comparecloud.in/
Example for your case with "Azure Data Factory":
You could use a mix of those products:
[Cloud Data Fusion]https://cloud.google.com/composer
Cloud DataPrep: This is a version of Trifacta. Good for data cleaning.
If you need to orchestrate workflows / etls, Cloud composer will do it for you. It is a managed Apache Airflow. Which means it will handle complex dependencies.
If you just need to trigger a job on a daily basis, Cloud Scheduler is your friend.
You can check the link here which is cloud services mapping
I have some questions related to Cloud Composer and BigQuery. We need to import and create an automated process to export tables from BigQuery to Storage.
I have 4 options at the moment:
bigquery_to_gcs Operator
BashOperator: Executing the "bq" command provided by the Cloud SDK on Cloud Composer.
Python Function: Create a Python function using the BigQuery API, almost the same as bigquery_to_gcs and execute this function with Airflow.
DataFlow: The job will be executed with Airflow too.
I have some thoughts related to the first 3 options thought. If the table is huge, is there a chance to consume a big part of the resources of Cloud Composer? I've been searching if the bashoperator and bigquery operator consumes some resources of Cloud Composer. Always thinking that this process is going to be in production in the future and more dags are running at the same time. If that’s true, Dataflow will be a more convenient option?
A good approach of dataflow is that we can export the table in just one file if we want, that's not possible with the other options if the table is more than 1GB.
BigQuery itself has a feature to export data to GCS. This means that if you use any of the things you mentioned (except for the Dataflow job), you will simply trigger an export job that will be performed and managed by BigQuery.
This means that you do not need to worry about the consumption of cluster resources in Composer. bigquery_to_gcs operator is simply the controller instructing BigQuery to do an export.
So, from the options you mention: bigquery_to_gcs operator, BashOperator, and Python function will incur a similar low cost. Just use whichever you find easier to manage.
I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files
I am setting up a relationship where two Google App Engine applications (A and B) need to share data. B needs to read data from A, but A is not directly accessible to B. Both A and B currently use Google Datastore (NOT persistent disk).
I have an idea where I take a snapshot of A's state and upload it to a separate Google Cloud Storage location. This location can be read by B.
Is it possible to take a snapshot of A using Google App Engine and upload this snapshot (perhaps in JSON) to a separate Google Cloud Storage location to be read from by B? If so, how?
What you're looking for is the Datastore managed export/import service:
This page describes how to export and import Cloud Firestore in
Datastore mode entities using the managed export and import service.
The managed export and import service is available through the gcloud
command-line tool and the Datastore mode Admin API (REST,
RPC).
You can see a couple of examples described in a bit more details in these more or less related posts:
Google AppEngine Getting 403 forbidden trying to update cron.yaml
Transferring data from product datastore to local development environment datastore in Google App Engine (Python)
You may need to take extra precautions:
if you need data consistency (exports are not atomic)
to handle potential conflicts in entity key IDs, especially if using manually-generated ones or referencing them in other entities
If A is not directly accessible to B isn't actually something intentional and you'd be OK with allowing B to access A then that's also possible. The datastore can be accessed from anywhere, even from outside Google Cloud (see How do I use Google datastore for my web app which is NOT hosted in google app engine?). It might be a bit tricky to set it up, but once that's done it's IMHO a smoother sharing approach than the export/import one.