BigQuery - copy a query into a new table - google-cloud-platform

I wrote a query for one of my Big Query table called historical and I would like to copy the result of this query into a new Big Query table called historical_recent. I have difficulties to figure out how to do this operation with Python. Right now, I am able to execute my query and get the expected result:
SELECT * FROM gcp-sandbox.dailydev.historical WHERE (date BETWEEN '2015-11-05 00:00:00' AND '
2015-11-07 23:00:00')
I am also able to copy a my Big Query table without making any changes with this script:
from google.cloud import bigquery
client = bigquery.Client()
job = client.copy_table(
'gcp-sandbox.dailydev.historical',
'gcp-sandbox.dailydev.historical_copy')
How can I combine both using Python?

You can use INSERT statement as in below example
INSERT `gcp-sandbox.dailydev.historical_recent`
SELECT *
FROM `gcp-sandbox.dailydev.historical`
WHERE date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00'

Using Python to save your query result.
from google.cloud import bigquery
client = bigquery.Client()
# Target table to save results
table_id = "gcp-sandbox.dailydev.historical_recent"
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
use_legacy_sql=True
)
sql = """
SELECT * FROM gcp-sandbox.dailydev.historical
WHERE (date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00')
"""
query = client.query(sql, job_config=job_config)
query.result()
print("Query results loaded to the table {}".format(table_id))
This example is based on the Google documentation.

Related

PowerBI Query contains transformations that can't be used for DirectQuery

I am using PowerBI Desktop (2.96.1061.0) to connect to a local MS SQL server so I can prepare some visualizations. It is important to mention that all data connections (Tables, SQL queries) are using the DirectQuery option.
It's been quite a smooth experience so far. No issues at all. Now I am trying to get some new data, again, through a direct SQL query:
SELECT BillId, string_agg(PGroupName, ', ')
FROM
(SELECT bm.ImportedBillsId as BillId, pg.Name as PGroupName
FROM [BillMp] bm
JOIN [Mps] m on bm.ImportersId = m.Id
JOIN [PGroups] pg on m.PoliticalGroupId = pg.Id
GROUP BY bm.ImportedBillsId, pg.Name) t
GROUP BY BillId
but for some reason, it is not letting me re-create the model and apply the new changes. No matter that the import wizard is able to visualize the actual data prior to the update. This is the error that I am getting:
I have also tried to import only the data from the internal/nested query
SELECT bm.ImportedBillsId as BillId, pg.Name as PGroupName
FROM [BillMp] bm
JOIN [Mps] m on bm.ImportersId = m.Id
JOIN [PGroups] pg on m.PoliticalGroupId = pg.Id
GROUP BY bm.ImportedBillsId, pg.Name
and process (according to this article) the other/outer query through PowerBI but I am still getting the same error.

Django How to select differrnt table based on input?

I have searched for the solution to this problem for a long time, but I haven't got the appropriate method.
Basically All I have is tons of tables, and I want to query value from different tables using raw SQL.
In Django, we need a class representing a table to perform the query, for example:
Routes.objects.raw("SELECT * FROM routes")
In this way, I can only query a table, but what if I want to query different tables based on the user's input?
I'm new to Django, back in ASP.NET we can simply do the following query:
string query = "SELECT * FROM " + county + " ;";
var bus = _context.Database.SqlQuery<keelung>(query).ToList();
Is this case, I can do the query directly on the database instead of the model class, and I can select the table based on the user's selection.
Is there any method to achieve this with Django?
You can run raw queries in Django like this -
From django.db import connection
cursor = connection.cursor()
table = my_table;
cursor.execute("Select * from " + table)
data = cursor.fetchall()

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

GCP Cloud Function to write data to BigQuery runs with success but data doesn't appear in BigQuery table

I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
Check data was being pulled -> PASSED, I printed out row_list and
data is there
Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
Print out the table details -> PASSED, see attached screenshot it all appears in the logs
Confirm it is able to find the table -> PASSED, I changed the name
of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Maybe this post in Google Cloud documentation could help.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table
You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table
Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list) in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery

GCP BigQuery how to set expiration date to table by python api

I am using BigQuery Python API to create table, and would like to set an expiration date to the table, so the table would be automatically dropped after certain days.
Here is my code:
client = bq.Client()
job_config = bq.QueryJobConfig()
dataset_id = dataset
table_ref = client.dataset(dataset_id).table(filename)
job_config.destination = table_ref
job_config.write_disposition = 'WRITE_TRUNCATE'
dt = datetime.now() + timedelta(seconds=259200)
unixtime = (dt - datetime(1970,1,1)).total_seconds()
expiration_time = unixtime
job_config.expires = expiration_time
query_job = client.query(query, job_config=job_config)
query_job.result()
The problem is that the expiration parameter doesn't seem to work. When I am checking the table detail in the UI, the expiration date is still Never.
To answer a slightly different question, instead of specifying the expiration as part of the request options, you can use a CREATE TABLE statement instead, where the relevant option is expiration_timestamp. For example:
CREATE OR REPLACE TABLE my_dataset.MyTable
(
x INT64,
y FLOAT64
)
OPTIONS (
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY)
);
This creates a table with two columns that will expire three days from now. CREATE TABLE supports an optional AS SELECT clause, too, if you want to create the table from the result of a query (the documentation goes into more detail).
To update an existing table expiration time with Python:
import datetime
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("project.dataset.table")
table.expires = datetime.datetime.now() + datetime.timedelta(days=1)
client.update_table(table, ['expires'])
Credits: /u/ApproximateIdentity
Looking at the docs for the query method we can see that it's not possible to set an expiration time in the query job config.
The proper way of doing so is setting at the Table resource, something like:
client = bq.Client()
job_config = bq.QueryJobConfig()
dataset_id = dataset
table_ref = client.dataset(dataset_id).table(filename)
table = bq.Table(table_ref)
dt = datetime.now() + timedelta(seconds=259200)
table.expires = dt
client.create_table(table)
query_job = client.query(query, job_config=job_config)
query_job.result()