How can i access metadata db of GCP Composer Airflow server? - google-cloud-platform

I have created one Composer in gcp project. I want to access the Metadatadb of Airflow which runs at background on Cloud SQL.
How can i access that?
Also i want to create one table inside that metadatadb which i will be using to store some data query by one of airflow dag. Is it ok to create any table inside that metadatadb or that metadatadb is only for airflow server use?

You can access Airflow internal DB via UI using Data Profiling -> Ad Hoc Query
There you can see all the tables with a SQL query like :
SHOW tables;
I wouldn't recommand creating a new table or manually inserting rows into existing tables thought.
You should also be able to access this DB in your DAGs operators and sensors by using airflow-db connexion.

Related

How can I save SQL script from AWS Athena view with boto3/python

I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.
Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.
Have you tried using get_query_results in boto3? get_query_results

Creating a BigQuery dataset from a log sink in GCP

When running
gcloud logging sinks list
it seems I have several sinks for my project
▶ gcloud logging sinks list
NAME DESTINATION FILTER
myapp1 bigquery.googleapis.com/projects/myproject/datasets/myapp1 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp1"
myapp2 bigquery.googleapis.com/projects/myproject/datasets/myapp2 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp2"
myapp3 bigquery.googleapis.com/projects/myproject/datasets/myapp3 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp3"
However, when I navigate in my BigQuery console, I don't see the corresponding datasets.
Is there a way to import these sinks as datasets so that I can run queries against them?
This guide on creating BigQuery datasets does not list how to do so from a log sink (unless I am missing something)
Also any idea why the above datasets are not displayed when using the bq ls command?
Firstly, be sure to be in the good project. if not, you can import dataset from external project by clicking on the PIN button (and you need to have enough permission for this).
Secondly, the Cloud Logging sink to BigQuery doesn't create the dataset, only the tables. So, if you have created the sinks without the dataset, you sinks aren't running (or run in error). Here more details
BigQuery: Select or create the particular dataset to receive the exported logs. You also have the option to use partitioned tables.
In general, what you expect for this feature to do is right, using BigQuery as log sink is to allow you to query the logs with BQ. For the problem you're facing, I believe it is to do with using Web console vs. gcloud.
When using BigQuery as log sink, there are 2 ways to specify a dataset:
point to an existing dataset
create a new dataset
When creating a new sink via web console, there's an option to have Cloud Logging create a new dataset for you as well. However, when using gcloud logging sinks create, it does not automatically create a dataset for you, only create the log sink. It seems like it also does not validate whether the specified dataset exists.
To resolve this, you could either use web console for the task or create the datasets on your own. There's nothing special about creating a BQ dataset to be a log sink destination comparing to creating a BQ dataset for other purpose. Create a BQ dataset, then create a log sink to point to the dataset and you're good to go.
Conceptually, different products (BigQuery, Cloud Logging) on GCP runs independently, the log sink in Cloud Logging is simply an object that pairs up filter and destination, but does not own/manage the destination resource (eg. BQ dataset). It's just that in web console, it provide some extra integration to make things easier.

How to deploy data from Django WebApp to a cloud database which can be accessed by Jupyter notebook such as Kaggle?

I have build a Django WebApp. It has an sql database. I would like to analyze this data and share the analysis using online platform Jupyter notebook such as Kaggle.
I have already deployed to Google App Engine as an SQL instance, but I don't know how to view this SQL instance tables in Kaggle. There is an option to view BigQuery databases in Kaggle, but I don't know how to get the data from my SQL instance to BigQuery.
To be able to access the data with Kaddle you would need to import the data from the CloudSQL instance into BigQuery.
Currently there are some options for importing data into BigQuery, the best choice would depend on what type of analysis you want to do with it.
If you just want to import the data from the CloudSQL instance into BigQuery, the easiest way to do it would be to first export the data in CSV format and then import the CSV file into BigQuery.
In case you are working with a large database, you can also do it programmatically by using the Client Libraries.

Migrate data to SQL DW for multiple tables

I'm currently using Azure Data Factory to move over data from an Azure SQL database to an Azure DW instance.
This works fine for one table, but I have a lot of tables I'd like to move over. Using Azure Data Facory, it looks like I need to create a set of Source/Sink datasets and pipelines for every table in the database.
Is there a way to move multiple tables across without have to set up each table in the manner described above?
The copy operation allows you to select multiple tables to move in a single pipeline. From the Azure SQL Data Warehouse portal you can follow this process to setup a multi-table pipeline:
Click on the Load Data button
Select Azure Data Factory
Create a new data factory or use an existing one - ensure that the Load Data select is chosen
Select the Run once now option
Choose your Azure SQL Database source and enter the credentials
On the Select Tables screen, select multiple tables
Continue the Pipeline, save and execute

VORA Tables in Zeppelin and Spark shell

We have created test table from spark shell as well as from Zepellin. But when we do show tables on single table is visible in respective environment. Table created via spark shell is not displayed in Zepellin show table command.
What is the difference between these two tables ? can anybody please explain.
The show tables command only shows the tables defined in the current session.
A table is created in a current session and also in a (persistent) catalog in Zookeeper. You can show all tables that Vora saved in Zookeeper via this command:
SHOW DATASOURCETABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
You can also register all or single tables in the current session via this command:
REGISTER ALL TABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
REGISTER TABLE <tablename>
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
So if you want to access the table that you created in the Spark Shell from Zookeeper and vice versa you need to register it first.
You can use those commands if you need to clear the Zookeeper Catalog. Be aware that tables then need to be recreated:
import com.sap.spark.vora.client._
ClusterUtils.clearZooKeeperCatalog("<zookeeper_server>:2181")
This (and more) information can be found in the Vora Installation and Developer Guide