Bigquery table to df (dataframe) in a cloud Function - google-cloud-platform

I have a bigquery table - I would like to extraxt it into a pandas dataframe inside cloud function and then do some changes in the header file and later save it into Cloud storage. Unfortunately my function is not working, can anyone see what could be the issue. Do i need to use big query extract job or my idea is also valid?
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
bucket_name = "abc_test"
project = "bq_project"
dataset_id = "bq_dataset"
table_id = "Partial_Return_Table"
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = (
"gs://abc_test/Exports/Partial_Return_Table.csv"
)
df.to_csv(destination_uri)
My requirement.txt looks like this
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
pyarrow

pyarrow library is the key here
import base64
import pandas as pd
from google.cloud import bigquery
def extract_partial_return(event, context):
client = bigquery.Client()
sql = """
SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
"""
# Running the query and putting the results directly into a df
df = client.query(sql).to_dataframe()
df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]
destination_uri = ("gs://abc_test/Exports/Partial_Return_Table.csv")
df.to_csv(destination_uri)
requirement.txt
pandas
fsspec
gcsfs
google-cloud-bigquery
google-cloud-storage
pyarrow

First of all, the idea to do this is okay, however you may want to use other product such as Dataflow or Dataproc which are designed for these purposes.
On the other hand, in order to complete the idea that you have now, you should take care of the way you are constructing the SQL command because you are not using the variables created for the project, dataset, etc. The same issue happens on the bucket. Moreover, I think that you are lacking a couple of dependencies (fsspec and gcsfs).
Manuel

Related

Cannot fix start manual transfer runs method for bigquery_datatransfer_v1

I am trying to run a scheduled query only when a table updates in big query. For this I am trying to make this python code to work in cloud functions, but it is giving me error. Would highly appreciate any help.
I am running this python code :
import time
from google.protobuf.timestamp_pb2 import Timestamp
from google.cloud import bigquery_datatransfer_v1
def runQuery (parent,requested_run_time):
client = bigquery_datatransfer_v1.DataTransferServiceClient()
projectid = '629586xxxx' # Enter your projectID here
transferid = '60cc15f8-xxxx-xxxx-8ba2-xxxxx41bc' # Enter your transferId here
parent = client.transfer_config_path(projectid, transferid)
start_time = Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
I get this error
start_manual_transfer_runs() got an unexpected keyword argument 'requested_run_time'
I created doing this using cloud function and it worked for me.
import time
from google.protobuf.timestamp_pb2 import Timestamp
from google.cloud import bigquery_datatransfer_v1
def runQuery(parent, requested_run_time):
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'graphical-reach-285218'
TRANSFER_CONFIG_ID = '61adfc39-0000-206b-a7b0-089e08324288'
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(parent)
print(response)
Replace with your project id and transfer config id. The above code will go in main.py and in requirements.txt , please keep
google-cloud-bigquery-datatransfer==1

Dynamic Handing of Bigquery table schema while inserting data into BQ table from variable

I am trying to append data to BQ table using python code which requires dynamic schema handling.
Can anyone provide me the link to handle above scenario.
An example code of loading a .csv file into BigQuery using the python client library:
# from google.cloud import bigquery
# client = bigquery.Client()
# filename = '/path/to/file.csv'
# dataset_id = 'my_dataset'
# table_id = 'my_table'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True
with open(filename, "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))
Also check this part of the documentation to know more about appending data into tables from a source file using the same or different schema.

Required parameter is missing error while writing to bigQuery with google.cloud.bigquery in Python

I am loading a New Line Delimited JSON to bigQuery using the following code snippet in Python 2.7:
from google.cloud import bigquery
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset('testGAData')
table_ref = dataset.table('gaData')
table = bigquery.Table(table_ref)
with open('gaData.json', 'rb') as source_file:
job_config = bigquery.LoadJobConfig()
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
job = bigquery_client.load_table_from_file(
source_file, table, job_config=job_config)
It returns me the following error:
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/google/cloud/bigquery/client.py", line 897, in load_table_from_file
raise exceptions.from_http_response(exc.response)
google.api_core.exceptions.BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/test-project-for-experiments/jobs?uploadType=resumable: Required parameter is missing
Why am I getting this error? How can I fix this? Has anyone else faced a similar issue? Thanks in advance.
Edit: Added last para, included python imports and corrected the indents.
Issues observed with the initial code
You are missing the schema for your table. You can either use job_config.autodetect = True or job_config.schema = [bigquery.SchemaField("FIELD NAME", "FIELD TYPE")].
From the documentation, you should set job_config.source_format = `bigquery.SourceFormat.NEWLINE_DELIMITED_JSON` for a JSON file source
You should pass yourtable_ref variable as an argument instead your table variable in bigquery_client.load_table_from_file(source_file, table, job_config=job_config)
Link to the documentation
Working Code
The below code works for me. I am using python 3 and google-cloud-bigquery v1.5
from google.cloud import bigquery
client = bigquery.Client()
dataset_id, table_id = "TEST_DATASET", "TEST_TABLE"
data_ref = client.dataset(dataset_id)
table_ref = data_ref.table(table_id)
file_path = "path/to/test.json"
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
#job_config.autodetect = True
job_config.schema = [bigquery.SchemaField("Name", "STRING"), bigquery.SchemaField("Age", "INTEGER")]
with open(file_path, 'rb') as source_file:
job = client.load_table_from_file(source_file, table_ref, location='US', job_config=job_config)
job.result()
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_id, table_id))
Output
>> Loaded 2 rows into TEST_DATASET:TEST_TABLE.

Python Script to query from DynamoDb not giving all items

I have written following python code to fetch data from a table but its not fetching all the items as I want. When I check on AWS console page of DynamoDb, I can see much more entries as compared to what I get from script.
from __future__ import print_function # Python 2/3 compatibility
import boto3
import json
import decimal
from datetime import datetime
from boto3.dynamodb.conditions import Key, Attr
import sys
# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
if o % 1 > 0:
return float(o)
else:
return int(o)
return super(DecimalEncoder, self).default(o)
dynamodb = boto3.resource('dynamodb', aws_access_key_id = '',
aws_secret_access_key = '',
region_name='eu-west-1', endpoint_url="http://dynamodb.eu-west-1.amazonaws.com")
mplaceId = int(sys.argv[1])
table = dynamodb.Table('XYZ')
response = table.query(
KeyConditionExpression=Key('mplaceId').eq(mplaceId)
)
print('Number of entries found ', len(response['Items']))
I did the same thing from aws console also. Query by mplaceId.
Any reason why its happening?
dynamodb.Table.query() returns at max 1MB of data. From the boto3 documentation:
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression. If LastEvaluatedKey is present in the response, you will need to paginate the result set. For more information, see Paginating the Results in the Amazon DynamoDB Developer Guide .
That's actually no boto3-limitation, but a limitation of the underlying query-API.
Instead of implementing pagination yourself, you can use boto3's built-in pagination . Here is an example showing the use of the paginator for querying DynamoDB tables provided by boto3:
import boto3
from boto3.dynamodb.conditions import Key
dynamodb_client = boto3.client('dynamodb')
paginator = dynamodb_client.get_paginator('query')
page_iterator = paginator.paginate(
TableName='XYZ',
KeyConditionExpression='mplaceId = :mplaceId',
ExpressionAttributeValues={':mplaceId': {'S' : mplaceid}}
)
for page in page_iterator:
print(page['Items'])

Retrieve S3 file as Object instead of downloading to absolute system path

I just started learning and using S3, read the docs. Actually I didn't find anything to fetch the file into an object instead of downloading it from S3? if this could be possible, or I am missing something?
Actually I want to avoid additional IO after downloading the file.
You might be looking for the get_object() method of the boto3 S3 client:
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.get_object
This will get you a response object dictionary with member Body that is a StreamingBody object, which you can use as normal file and call .read() method on it. To get the entire content of the S3 object into memory you would do something like this:
s3_client = boto3.client('s3')
s3_response_object = s3_client.get_object(Bucket=BUCKET_NAME_STRING, Key=FILE_NAME_STRING)
object_content = s3_response_object['Body'].read()
I prefer this approach, equivalent to a previous answer:
import boto3
s3 = boto3.resource('s3')
def read_s3_contents(bucket_name, key):
response = s3.Object(bucket_name, key).get()
return response['Body'].read()
But another approach could read the object into StringIO:
import StringIO
import boto3
s3 = boto3.resource('s3')
def read_s3_contents_with_download(bucket_name, key):
string_io = StringIO.StringIO()
s3.Object(bucket_name, key).download_fileobj(string_io)
return string_io.getvalue()
You could use StringIO and get file content from S3 using get_contents_as_string, like this:
import pandas as pd
from io import StringIO
from boto.s3.connection import S3Connection
AWS_KEY = 'XXXXXXDDDDDD'
AWS_SECRET = 'pweqory83743rywiuedq'
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('YOUR_BUCKET')
fileName = "test.csv"
content = bucket.get_key(fileName).get_contents_as_string()
reader = pd.read_csv(StringIO.StringIO(content))