GCP Cloud Function to write data to BigQuery runs with success but data doesn't appear in BigQuery table - google-cloud-platform

I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
Check data was being pulled -> PASSED, I printed out row_list and
data is there
Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
Print out the table details -> PASSED, see attached screenshot it all appears in the logs
Confirm it is able to find the table -> PASSED, I changed the name
of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Maybe this post in Google Cloud documentation could help.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table

You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table

Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list) in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery

Related

BigQuery - copy a query into a new table

I wrote a query for one of my Big Query table called historical and I would like to copy the result of this query into a new Big Query table called historical_recent. I have difficulties to figure out how to do this operation with Python. Right now, I am able to execute my query and get the expected result:
SELECT * FROM gcp-sandbox.dailydev.historical WHERE (date BETWEEN '2015-11-05 00:00:00' AND '
2015-11-07 23:00:00')
I am also able to copy a my Big Query table without making any changes with this script:
from google.cloud import bigquery
client = bigquery.Client()
job = client.copy_table(
'gcp-sandbox.dailydev.historical',
'gcp-sandbox.dailydev.historical_copy')
How can I combine both using Python?
You can use INSERT statement as in below example
INSERT `gcp-sandbox.dailydev.historical_recent`
SELECT *
FROM `gcp-sandbox.dailydev.historical`
WHERE date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00'
Using Python to save your query result.
from google.cloud import bigquery
client = bigquery.Client()
# Target table to save results
table_id = "gcp-sandbox.dailydev.historical_recent"
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
use_legacy_sql=True
)
sql = """
SELECT * FROM gcp-sandbox.dailydev.historical
WHERE (date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00')
"""
query = client.query(sql, job_config=job_config)
query.result()
print("Query results loaded to the table {}".format(table_id))
This example is based on the Google documentation.

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

Optimal ETL process and platform

I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek
If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio

How to create a BigQuery table with Airflow failure notification?

I have a Airflow DAG on GCP composer that runs every 5 minutes. I would like to create a BigQuery table that will have the time when DAG starts to run and a flag identifying whether it's a successful run or failed run. For example, if the DAG runs at 2020-03-23 02:30 and the run fails, the BigQuery table will have time column with 2020-03-23 02:30 and flag column with 1. If it's a successful run, then the table will have time column with 2020-03-23 02:30 and flag column with 0. The table will append new rows.
Thanks in advance
You can list_dag_runs CLI to list the DAG runs for a given dag_id. The information returned includes the state of each run.
Another option is retrieving the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun.
dag_id = 'my_dag'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
Finally, use the BigQuery operator to write the DAg information into a BigQuery table. You can find an example of how to use the BigQueryOperator here.
Based on the solution by #Enrique, Here is my final solution.
def status_check(**kwargs):
dag_id = 'dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
import pandas as pd
import pandas_gbq
from google.cloud import bigquery
arr = []
arr1 = []
for dag_run in dag_runs:
arr.append(dag_run.state)
arr1.append(dag_run.execution_date)
data1 = {'dag_status': arr, 'time': arr1}
df = pd.DataFrame(data1)
project_name = "project_name"
dataset = "Dataset"
outputBQtableName = '{}'.format(dataset)+'.dag_status_tab'
df.to_gbq(outputBQtableName, project_id=project_name, if_exists='replace', progress_bar=False, \
table_schema= \
[{'name': 'dag_status', 'type': 'STRING'}, \
{'name': 'time', 'type': 'TIMESTAMP'}])
return None
Dag_status = PythonOperator(
task_id='Dag_status',
python_callable=status_check,
)

DynamoDB BatchWriteItem: Provided list of item keys contains duplicates

I am trying to use DynamoDB operation BatchWriteItem, wherein I want to insert multiple records into one table.
This table has one partition key and one sort key.
I am using AWS lambda and Go language.
I get the elements to be inserted into a slice.
I am following this procedure.
Create PutRequest structure and add AttributeValues for the first record from the list.
I am creating WriteRequest from this PutRequest
I am adding this WriteRequest to an array of WriteRequests
I am creating BatchWriteItemInput which consists of RequestItems, which is basically a map of Tablename and the array of WriteRequests.
After that I am calling BatchWriteItem, which results into an error:
Provided list of item keys contains duplicates.
Any pointers, why this could be happening?
You've provided two or more items with identical primary keys (which in your case means identical partition and sort keys).
Per the BatchWriteItem docs, you cannot perform multiple operations on the same item in the same BatchWriteItem request.
Consideration: This answers works for Python
As #Benoit has remarked, the boto3 documentation states:
If you want to bypass no duplication limitation of single batch write request as botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the BatchWriteItem operation: Provided list of item keys contains duplicates.
you could specify overwrite_by_pkeys=['partition_key', 'sort_key'] on the batch writer to "de-duplicate request items in buffer if match new request item on specified primary keys" according to the documentation and the source code. That is, if the combination primary-sort already exists in the buffer it will drop that request and replace it with the new one.
Example
Suppose there is pandas dataframe that you want to write to a DynamoDB table, the following function could be helpful,
import json
import datetime as dt
import boto3
import pandas as pd
from typing import Optional
def write_dynamoDB(df:'pandas.core.frame.DataFrame', tbl:str, partition_key:Optional[str]=None, sort_key:Optional[str]=None):
'''
Function to write a pandas DataFrame to a DynamoDB Table through
batchWrite operation. In case there are any float values it handles
them by converting the data to a json format.
Arguments:
* df: pandas DataFrame to write to DynamoDB table.
* tbl: DynamoDB table name.
* partition_key (Optional): DynamoDB table partition key.
* sort_key (Optional): DynamoDB table sort key.
'''
# Initialize AWS Resource
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(tbl)
# Check if overwrite keys were provided
overwrite_keys = [partition_key, sort_key] if partition_key else None
# Check if they are floats (convert to decimals instead)
if any([True for v in df.dtypes.values if v=='float64']):
from decimal import Decimal
# Save decimals with JSON
df_json = json.loads(
json.dumps(df.to_dict(orient='records'),
default=date_converter,
allow_nan=True),
parse_float=Decimal
)
# Batch write
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
for element in df_json:
batch.put_item(
Item=element
)
else: # If there are no floats on data
# Batch writing
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
columns = df.columns
for row in df.itertuples():
batch.put_item(
Item={
col:row[idx+1] for idx,col in enumerate(columns)
}
)
def date_converter(obj):
if isinstance(obj, dt.datetime):
return obj.__str__()
elif isinstance(obj, dt.date):
return obj.isoformat()
by calling write_dynamoDB(dataframe, 'my_table', 'the_partition_key', 'the_sort_key').
Use batch_writer instead of batch_write_item:
import boto3
dynamodb = boto3.resource("dynamodb", region_name='eu-west-1')
my_table = dynamodb.Table('mirrorfm_yt_tracks')
with my_table.batch_writer(overwrite_by_pkeys=["user_id", "game_id"]) as batch:
for item in items:
batch.put_item(
Item={
'user_id': item['user_id'],
'game_id': item['game_id'],
'score': item['score']
}
)
If you don't have a sort key, overwrite_by_pkeys can be None
This is essentially the same answer as #MiguelTrejo (thanks! +1) but simplified