Batch delete BigTable tables and BigQuery datasets - google-cloud-platform

I searched around to find a way to Batch delete BigTable tables and BigQuery datasets (using python's library) without any luck up to now.
Is anyone aware of an efficient way to do that?
I looked into these links but nothing promising :
BigQuery
BigTable
Im looking for something similar as this one coming from datastore documentation:
from google.cloud import datastore
# For help authenticating your client, visit
# https://cloud.google.com/docs/authentication/getting-started
client = datastore.Client()
keys = [client.key("Task", 1), client.key("Task", 2)]
client.delete_multi(keys)
Batch delete

I think it's not possible natively, you have to develop your own script.
For example you can configure all the tables to delete, then there are many solutions :
Develop a Python script, loop on the tables to delete and use Python Bigquery and Bigtable clients : https://cloud.google.com/bigquery/docs/samples/bigquery-delete-dataset
https://cloud.google.com/bigtable/docs/samples/bigtable-hw-delete-table
Develop a shell script, loop on the tables to delete and use bq and cbt (from gcloud sdk) :
https://cloud.google.com/bigquery/docs/managing-tables?hl=en#deleting_a_table
https://cloud.google.com/bigtable/docs/cbt-reference?hl=fr
If it's possible on your side, you can also use Terraform to delete multiple Bigquery and Bigtable tables, but it's more adapted if you need to manage a state for your infrastructure :
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_table
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigtable_table

Related

How can I save SQL script from AWS Athena view with boto3/python

I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.
Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.
Have you tried using get_query_results in boto3? get_query_results

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars

MySQL to Google Big Query

I have several Django (python) based, back-end web applications that I would like to start piping data into Google Big Query in an automated fashion. The relational database on the backend is MySQL, these applications are not public facing and not in Google App Engine.
We already have Google Apps for Business along with a Google Big Data project set up. With that said I can manually dump tables into CSV and import into Big Query but is there some best practices on automating this kind of data delivery into Google? I've poured over the documentation and don't really see any definitive writing on this matter.
Any advice would be appreciated.
Thanks for reading
Recently WePay started a series of articles on how they use BigQuery to run their analytics. Their second article highlights how they use Apache AirFlow to move data from MySQL to BigQuery:
https://wecode.wepay.com/posts/airflow-wepay
As they mention "We have only a single config-driven ETL DAG file. It dynamically generates over 200 DAGs", and "The most important part is the select block. This defines which columns we pull from MySQL and load into BigQuery".
See the article for more details.
You can use Python robots, which run on Linux with crontab.
For loading into Google Cloud Platform BigQuery, I use pandas_gbq.to_gbq library:
Create your dataframe (df) according to this or this
In order to get the Token.jsonfile:
Create a Google Cloud Platform BigQuery service account.
Load the JSON file:
from google.oauth2 import service_account
import pandas as pd
import pandas_gbq
DIR = os.path.dirname(os.path.realpath(__file__))
TOKEN_AUTH = DIR + '/token.json'
CREDENTIALS = service_account.Credentials.from_service_account_file(TOKEN_AUTH)
#df is a pandas dataframe
pandas_gbq.to_gbq(df, '<dataset>.<table_name>', project_id='<project_id>',
if_exists=<replace or append> , credentials=CREDENTIALS)
Once you have created your tocken, install crontab on Linux and schedule your load-robot task:
Using crontab to execute script every minute and another every 24 hours
Finally, you can also use Apache Airflow (for advanced users with Docker skills)