MySQL to Google Big Query - django

I have several Django (python) based, back-end web applications that I would like to start piping data into Google Big Query in an automated fashion. The relational database on the backend is MySQL, these applications are not public facing and not in Google App Engine.
We already have Google Apps for Business along with a Google Big Data project set up. With that said I can manually dump tables into CSV and import into Big Query but is there some best practices on automating this kind of data delivery into Google? I've poured over the documentation and don't really see any definitive writing on this matter.
Any advice would be appreciated.
Thanks for reading

Recently WePay started a series of articles on how they use BigQuery to run their analytics. Their second article highlights how they use Apache AirFlow to move data from MySQL to BigQuery:
https://wecode.wepay.com/posts/airflow-wepay
As they mention "We have only a single config-driven ETL DAG file. It dynamically generates over 200 DAGs", and "The most important part is the select block. This defines which columns we pull from MySQL and load into BigQuery".
See the article for more details.

You can use Python robots, which run on Linux with crontab.
For loading into Google Cloud Platform BigQuery, I use pandas_gbq.to_gbq library:
Create your dataframe (df) according to this or this
In order to get the Token.jsonfile:
Create a Google Cloud Platform BigQuery service account.
Load the JSON file:
from google.oauth2 import service_account
import pandas as pd
import pandas_gbq
DIR = os.path.dirname(os.path.realpath(__file__))
TOKEN_AUTH = DIR + '/token.json'
CREDENTIALS = service_account.Credentials.from_service_account_file(TOKEN_AUTH)
#df is a pandas dataframe
pandas_gbq.to_gbq(df, '<dataset>.<table_name>', project_id='<project_id>',
if_exists=<replace or append> , credentials=CREDENTIALS)
Once you have created your tocken, install crontab on Linux and schedule your load-robot task:
Using crontab to execute script every minute and another every 24 hours
Finally, you can also use Apache Airflow (for advanced users with Docker skills)

Related

Batch delete BigTable tables and BigQuery datasets

I searched around to find a way to Batch delete BigTable tables and BigQuery datasets (using python's library) without any luck up to now.
Is anyone aware of an efficient way to do that?
I looked into these links but nothing promising :
BigQuery
BigTable
Im looking for something similar as this one coming from datastore documentation:
from google.cloud import datastore
# For help authenticating your client, visit
# https://cloud.google.com/docs/authentication/getting-started
client = datastore.Client()
keys = [client.key("Task", 1), client.key("Task", 2)]
client.delete_multi(keys)
Batch delete
I think it's not possible natively, you have to develop your own script.
For example you can configure all the tables to delete, then there are many solutions :
Develop a Python script, loop on the tables to delete and use Python Bigquery and Bigtable clients : https://cloud.google.com/bigquery/docs/samples/bigquery-delete-dataset
https://cloud.google.com/bigtable/docs/samples/bigtable-hw-delete-table
Develop a shell script, loop on the tables to delete and use bq and cbt (from gcloud sdk) :
https://cloud.google.com/bigquery/docs/managing-tables?hl=en#deleting_a_table
https://cloud.google.com/bigtable/docs/cbt-reference?hl=fr
If it's possible on your side, you can also use Terraform to delete multiple Bigquery and Bigtable tables, but it's more adapted if you need to manage a state for your infrastructure :
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_table
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigtable_table

How to I import data from csv to gcp datastore

I have been trying to import a sample nosql db to gcp datastore. when stored in gcs datastore is asking for data in specific extension i.e
.overall_export_metadata.
I don't believe there are any existing tools that can just import a CSV into datastore. You could write a Google Dataflow job to do this.
https://beam.apache.org/documentation/programming-guide/
https://cloud.google.com/dataflow/docs/quickstarts
It does look like they provide a template-based job that takes in a JSON file and writes it to datastore
https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#gcstexttodatastore
JSON format:
https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

Import database problem in google SQL cloud

I have already host of my wordpress website on Google Cloud Platform using Google Cloud Compute Engine. Now I want to split my existing wordpress database and move to Google SQL Cloud to improve my website performance.
I'm creating successfully SQL instance on Google Cloud SQL cloud. I refer to this link but I got error when I'm uploading my wordpress database backup.
After creating database on Google Cloud SQL when I click on import button, it take few minutes and show import failed : error 1031 (hy000) table storage engine for wp_wcfm_daily_analysis doesn't have this option error.
Thanks in advance.
In one of your import file there is the command that tries to change the storage engine from InnoDB to some other storage engine, probably to MyISAM.
As it is stated in the CloudSQL documentation:
InnoDB is the only supported storage engine for Second Generation instances because it is more resistant to table corruption than other MySQL storage engines, such as MyISAM.
You need to check in your sql file that you want to import, if you have the option : ENGINE = MyISAM attached to any CREATE TABLE command, and remove it.
You can also try to convert all your tables to InnoDB by using the following SQL code:
SET #DATABASE_NAME = 'name_of_your_db';
SELECT CONCAT('ALTER TABLE `', table_name, '` ENGINE=InnoDB;') AS sql_statements
FROM information_schema.tables AS tb
WHERE table_schema = #DATABASE_NAME
AND `ENGINE` = 'MyISAM'
AND `TABLE_TYPE` = 'BASE TABLE'
ORDER BY table_name DESC;
You can find here a related discussion.

How to deploy data from Django WebApp to a cloud database which can be accessed by Jupyter notebook such as Kaggle?

I have build a Django WebApp. It has an sql database. I would like to analyze this data and share the analysis using online platform Jupyter notebook such as Kaggle.
I have already deployed to Google App Engine as an SQL instance, but I don't know how to view this SQL instance tables in Kaggle. There is an option to view BigQuery databases in Kaggle, but I don't know how to get the data from my SQL instance to BigQuery.
To be able to access the data with Kaddle you would need to import the data from the CloudSQL instance into BigQuery.
Currently there are some options for importing data into BigQuery, the best choice would depend on what type of analysis you want to do with it.
If you just want to import the data from the CloudSQL instance into BigQuery, the easiest way to do it would be to first export the data in CSV format and then import the CSV file into BigQuery.
In case you are working with a large database, you can also do it programmatically by using the Client Libraries.