We are using GCP composer (Airflow managed) as orchestral tools and BigQuery as DB. I need to push data into table from another table (both of the tables located in bigquery db) but the method should be upsert. So I wrote a sql script that using marge to update or insert.
I have 2 questions:
The marge script located in GCP Composer bucket, how can I read the sql script from the bucket ?
After reading the sql file, how can I run the query on bigquery ?
Thanks
You can use the script below to read a file in GCS. I tested this using an SQL script that does INSERT and is saved in my Composer bucket.
In read_gcs_op it will execute read_gcs_file() and return the content of the sql script. The content of the sql script will be used by execute_query and execute the query in the script. See code below:
import datetime
from airflow import models
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
from google.cloud import bigquery
import logging
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'your-composer-bucket'
GCS_FILES = ['sql_query.txt']
PREFIX = 'data' # populate this if you stored your sql script in a directory in the bucket
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'query_gcs_to_bq',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
def read_gcs_file(**kwargs):
hook = GCSHook()
for gcs_file in GCS_FILES:
#check if PREFIX is available and initialize the gcs file to be copied
if PREFIX:
object_name = f'{PREFIX}/{gcs_file}'
else:
object_name = f'{gcs_file}'
#perform gcs hook download
resp_byte = hook.download_as_byte_array(
bucket_name = BUCKET_NAME,
object_name = object_name,
)
resp_string = resp_byte.decode("utf-8")
logging.info(resp_string)
return resp_string
read_gcs_op = python.PythonOperator(
task_id='read_gcs',
provide_context=True,
python_callable=read_gcs_file,
)
sql_query = "{{ task_instance.xcom_pull(task_ids='read_gcs') }}" # store returned value from read_gcs_op
def query_bq(sql):
hook = BigQueryHook(bigquery_conn_id="bigquery_default", delegate_to=None, use_legacy_sql=False)
client = bigquery.Client(project=hook._get_field("project"), credentials=hook._get_credentials())
client.query(sql) # If you are not doing DML, you assign this to a variable and return the value
execute_query = python.PythonOperator(
task_id='query_bq',
provide_context=True,
python_callable=query_bq,
op_kwargs = {
"sql": sql_query
}
)
read_gcs_op >> execute_query
For testing I used an INSERT statement as the SQL script used by the script above:
sql_script.txt
INSERT `your-project.dataset.your_table` (name, age)
VALUES('Brady', 44)
Test done:
Return value of task read_gcs:
After Composer is done executing read_gcs and query_bq, I checked my table insert statement succeeded.:
Related
Scenario need to be catered:
User will share sales.csv file in Google Bucket
sales.csv file data should be uploaded in the Google BigQuery everytime with the timestamp.
Can someone guide me how to do it with best practices?
for that you need to follow these steps:-
Step 1:- Create a Google Cloud Storage bucket
Step 2:- Set up Google Cloud Functions
Step 3:- Write the Cloud Function (You can write cloud functions in any computer language)
from google.Cloud import storage, bigquery
def load_sales_data(event, context):
file = event
timestamp = str(int(time.time() * 1000))
table_name = f"sales_{timestamp}"
bucket_name = file['bucket'] #your bucket name
file_name = file['name'] #your file name
bq_client = bigquery.Client()
dataset = bq_client.dataset('my_dataset')# your dataset name
table = dataset.table(table_name) #your table name
schema = [
bigquery.SchemaField("id", "INTEGER"),
bigquery.SchemaField("date", "DATE"),
bigquery.SchemaField("amount", "FLOAT"),
]
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=schema,
)
uri = f"gs://{bucket_name}/{file_name}"
load_job = bq_client.load_table_from_uri(uri,table,job_config=job_config)
load_job.result()
print(f"Data loaded into {table_name}")
I have a bigquery table - I would like to extraxt it into a pandas dataframe inside cloud function and then do some changes in the header file and later save it into Cloud storage. Unfortunately my function is not working, can anyone see what could be the issue. Do i need to use big query extract job or my idea is also valid?
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
bucket_name = "abc_test"
project = "bq_project"
dataset_id = "bq_dataset"
table_id = "Partial_Return_Table"
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = (
"gs://abc_test/Exports/Partial_Return_Table.csv"
)
df.to_csv(destination_uri)
My requirement.txt looks like this
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
pyarrow
pyarrow library is the key here
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = ("gs://abc_test/Exports/Partial_Return_Table.csv")
df.to_csv(destination_uri)
requirement.txt
pandas
fsspec
gcsfs
google-cloud-bigquery
google-cloud-storage
pyarrow
First of all, the idea to do this is okay, however you may want to use other product such as Dataflow or Dataproc which are designed for these purposes.
On the other hand, in order to complete the idea that you have now, you should take care of the way you are constructing the SQL command because you are not using the variables created for the project, dataset, etc. The same issue happens on the bucket. Moreover, I think that you are lacking a couple of dependencies (fsspec and gcsfs).
Manuel
I am trying to append data to BQ table using python code which requires dynamic schema handling.
Can anyone provide me the link to handle above scenario.
An example code of loading a .csv file into BigQuery using the python client library:
# from google.cloud import bigquery
# client = bigquery.Client()
# filename = '/path/to/file.csv'
# dataset_id = 'my_dataset'
# table_id = 'my_table'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True
with open(filename, "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))
Also check this part of the documentation to know more about appending data into tables from a source file using the same or different schema.
I am loading a New Line Delimited JSON to bigQuery using the following code snippet in Python 2.7:
from google.cloud import bigquery
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset('testGAData')
table_ref = dataset.table('gaData')
table = bigquery.Table(table_ref)
with open('gaData.json', 'rb') as source_file:
job_config = bigquery.LoadJobConfig()
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
job = bigquery_client.load_table_from_file(
source_file, table, job_config=job_config)
It returns me the following error:
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/google/cloud/bigquery/client.py", line 897, in load_table_from_file
raise exceptions.from_http_response(exc.response)
google.api_core.exceptions.BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/test-project-for-experiments/jobs?uploadType=resumable: Required parameter is missing
Why am I getting this error? How can I fix this? Has anyone else faced a similar issue? Thanks in advance.
Edit: Added last para, included python imports and corrected the indents.
Issues observed with the initial code
You are missing the schema for your table. You can either use job_config.autodetect = True or job_config.schema = [bigquery.SchemaField("FIELD NAME", "FIELD TYPE")].
From the documentation, you should set job_config.source_format = `bigquery.SourceFormat.NEWLINE_DELIMITED_JSON` for a JSON file source
You should pass yourtable_ref variable as an argument instead your table variable in bigquery_client.load_table_from_file(source_file, table, job_config=job_config)
Link to the documentation
Working Code
The below code works for me. I am using python 3 and google-cloud-bigquery v1.5
from google.cloud import bigquery
client = bigquery.Client()
dataset_id, table_id = "TEST_DATASET", "TEST_TABLE"
data_ref = client.dataset(dataset_id)
table_ref = data_ref.table(table_id)
file_path = "path/to/test.json"
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
#job_config.autodetect = True
job_config.schema = [bigquery.SchemaField("Name", "STRING"), bigquery.SchemaField("Age", "INTEGER")]
with open(file_path, 'rb') as source_file:
job = client.load_table_from_file(source_file, table_ref, location='US', job_config=job_config)
job.result()
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_id, table_id))
Output
>> Loaded 2 rows into TEST_DATASET:TEST_TABLE.
I am using the python module called PyAthenaJDBC in order to query Athena using the provided JDBC driver.
Here is the link : https://pypi.python.org/pypi/PyAthenaJDBC/
I have been facing some persistent issue. I keep getting this java error whenever I use the Athena connection twice in a row.
As a matter of fact, I was able to connect to Athena, show databases, create new tables and even query the content. I am building an application using Django and running its server to use Athena
However, I am obliged to re-run the server in order for the Athena connection to work once again,
Here is a glimpse of the class I have built
import os
import configparser
import pyathenajdbc
#Get aws credentials for the moment
aws_config_file = '~/.aws/config'
Config = configparser.ConfigParser()
Config.read(os.path.expanduser(aws_config_file))
access_key_id = Config['default']['aws_access_key_id']
secret_key_id = Config['default']['aws_secret_access_key']
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
athena_jdbc_driver_path = BASE_DIR + "/lib/static/AthenaJDBC.jar"
log_path = BASE_DIR + "/lib/static/queries.log"
class PyAthenaLoader():
def __init__(self):
pyathenajdbc.ATHENA_JAR = athena_jdbc_driver_path
def connecti(self):
self.conn = pyathenajdbc.connect(
s3_staging_dir="s3://aws-athena-query-results--us-west-2",
access_key=access_key_id,
secret_key=secret_key_id,
#profile_name = "default",
#credential_file = aws_config_file,
region_name="us-west-2",
log_path=log_path,
driver_path=athena_jdbc_driver_path
)
def databases(self):
dbs = self.query("show databases;")
return dbs
def tables(self, database):
tables = self.query("show tables in {0};".format(database))
return tables
def create(self):
self.connecti()
try:
with self.conn.cursor() as cursor:
cursor.execute(
"""CREATE EXTERNAL TABLE IF NOT EXISTS sales4 (
Day_ID date,
Product_Id string,
Store_Id string,
Sales_Units int,
Sales_Cost float,
Currency string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '|',
'field.delim' = '|',
'collection.delimm' = 'undefined',
'mapkey.delim' = 'undefined'
) LOCATION 's3://athena-internship/';
""")
res = cursor.description
finally:
self.conn.close()
return res
def query(self, req):
self.connecti()
try:
with self.conn.cursor() as cursor:
cursor.execute(req)
print(cursor.description)
res = cursor.fetchall()
finally:
self.conn.close()
return res
def info(self):
res = []
for i in dir(pyathenajdbc):
temp = i + ' = ' + str(dic[i])
#print(temp)
res.append(temp)
return res
Example of usage :
def test(request):
athena = jdbc.PyAthenaLoader()
res = athena.query('Select * from sales;')
return render(request, 'test.html', {'data': res})
Works just fine!
However refreshing the page would cause this error :
Error
Note that I am using a local .jar file: I thought that would solve the issue but I was wrong
Even if I remove the path of the JDBC driver and let the module download it from s3, the error persists:
File "/home/tewfikghariani/.virtualenvs/venv/lib/python3.4/site-packages/pyathenajdbc/connection.py", line 69, in init
ATHENA_CONNECTION_STRING.format(region=self.region_name, schema=schema_name), props)
jpype._jexception.java.sql.SQLExceptionPyRaisable:
java.sql.SQLException: No suitable driver found for
jdbc:awsathena://athena.us-west-2.amazonaws.com:443/hive/default/
Furthermore, when I run the module on its own, it works just fine.
When I set multiple connection inside my view before rendering the template, that works just fine as well.
I guess the issue is related to the django view, once one of the views is performing a connection with athena, the next connection is not possible anymore and the error is raised unless I restart the server
Any help? If other details are missing I will provide them immediately.
Update:
After posting the issue in github, the author solved this problem and released a new version that works perfectly.
It was a multi-threading problem with JPype.
Question answered!
ref : https://github.com/laughingman7743/PyAthenaJDBC/pull/8