Scenario need to be catered:
User will share sales.csv file in Google Bucket
sales.csv file data should be uploaded in the Google BigQuery everytime with the timestamp.
Can someone guide me how to do it with best practices?
for that you need to follow these steps:-
Step 1:- Create a Google Cloud Storage bucket
Step 2:- Set up Google Cloud Functions
Step 3:- Write the Cloud Function (You can write cloud functions in any computer language)
from google.Cloud import storage, bigquery
def load_sales_data(event, context):
file = event
timestamp = str(int(time.time() * 1000))
table_name = f"sales_{timestamp}"
bucket_name = file['bucket'] #your bucket name
file_name = file['name'] #your file name
bq_client = bigquery.Client()
dataset = bq_client.dataset('my_dataset')# your dataset name
table = dataset.table(table_name) #your table name
schema = [
bigquery.SchemaField("id", "INTEGER"),
bigquery.SchemaField("date", "DATE"),
bigquery.SchemaField("amount", "FLOAT"),
]
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=schema,
)
uri = f"gs://{bucket_name}/{file_name}"
load_job = bq_client.load_table_from_uri(uri,table,job_config=job_config)
load_job.result()
print(f"Data loaded into {table_name}")
Related
I have recently completed this tutorial from AWS on how to create a thumbnail generator using lambda and S3: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-tutorial.html . Basically, I'm uploading an image file to my '-source' bucket and then lambda generates a thumbnail and uploads it to my '-thumbnail' bucket.
Everything works as expected. However, I wanted to use s3 object URL in the '-thumbnail' bucket so that I can load the image from there for a small app I'm building. The issue I'm having is that the URL doesn't display the image in the browser but instead downloads the file. This causes my app to error out.
I did some research and learned that I had to change the content-type to image/jpeg and then also made the object public using ACL. This works for all of the other buckets I have except the one that has the thumbnail. I have recreated this bucket several times. I even copied the settings from my existing buckets. I have compared settings to all the other buckets and they appear to be the same.
I wanted to reach out and see if anyone has ran into this type of issue before. Or if there is something I might be missing.
Here is the code I'm using to generate the thumbnail.
import boto3
from boto3.dynamodb.conditions import Key, Attr
import os
import sys
import uuid
import urllib.parse
from urllib.parse import unquote_plus
from PIL.Image import core as _imaging
import PIL.Image
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['DB_TABLE_NAME'])
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
recordId = key
tmpkey = key.replace('/', '')
download_path = '/tmp/{}{}'.format(uuid.uuid4(), tmpkey)
upload_path = '/tmp/resized-{}'.format(tmpkey)
try:
s3.download_file(bucket, key, download_path)
resize_image(download_path, upload_path)
bucket = bucket.replace('source', 'thumbnail')
s3.upload_file(upload_path, bucket, key)
print(f"Thumbnail created and uploaded to {bucket} successfully.")
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
else:
s3.put_object_acl(ACL='public-read',
Bucket=bucket,
Key=key)
#create image url to add to dynamo
url = f"https://postreader-thumbnail.s3.us-west-2.amazonaws.com/{key}"
print(url)
#create record id to update the appropriate record in the 'Posts' table
recordId = key.replace('.jpeg', '')
#add the image_url column along with the image url as the value
table.update_item(
Key={'id':recordId},
UpdateExpression=
"SET #statusAtt = :statusValue, #img_urlAtt = :img_urlValue",
ExpressionAttributeValues=
{':statusValue': 'UPDATED', ':img_urlValue': url},
ExpressionAttributeNames=
{'#statusAtt': 'status', '#img_urlAtt': 'img_url'},
)
def resize_image(image_path, resized_path):
with PIL.Image.open(image_path) as image:
#change to standard/hard-coded size
image.thumbnail(tuple(x / 2 for x in image.size))
image.save(resized_path)
This could happen if the Content-Type of the file you're uploading is binary/octet-stream , you can modify your script like below to provide custom content-type while uploading.
s3.upload_file(upload_path, bucket, key, ExtraArgs={'ContentType':
"image/jpeg"})
After more troubleshooting the issue was apparently related to the bucket's name. I created a new bucket with a different name than it had previously. After doing so I was able to upload and share images without issue.
I edited my code so that the lambda uploads to the new bucket name and I am able to share the image via URL without downloading.
We are using GCP composer (Airflow managed) as orchestral tools and BigQuery as DB. I need to push data into table from another table (both of the tables located in bigquery db) but the method should be upsert. So I wrote a sql script that using marge to update or insert.
I have 2 questions:
The marge script located in GCP Composer bucket, how can I read the sql script from the bucket ?
After reading the sql file, how can I run the query on bigquery ?
Thanks
You can use the script below to read a file in GCS. I tested this using an SQL script that does INSERT and is saved in my Composer bucket.
In read_gcs_op it will execute read_gcs_file() and return the content of the sql script. The content of the sql script will be used by execute_query and execute the query in the script. See code below:
import datetime
from airflow import models
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
from google.cloud import bigquery
import logging
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'your-composer-bucket'
GCS_FILES = ['sql_query.txt']
PREFIX = 'data' # populate this if you stored your sql script in a directory in the bucket
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'query_gcs_to_bq',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
def read_gcs_file(**kwargs):
hook = GCSHook()
for gcs_file in GCS_FILES:
#check if PREFIX is available and initialize the gcs file to be copied
if PREFIX:
object_name = f'{PREFIX}/{gcs_file}'
else:
object_name = f'{gcs_file}'
#perform gcs hook download
resp_byte = hook.download_as_byte_array(
bucket_name = BUCKET_NAME,
object_name = object_name,
)
resp_string = resp_byte.decode("utf-8")
logging.info(resp_string)
return resp_string
read_gcs_op = python.PythonOperator(
task_id='read_gcs',
provide_context=True,
python_callable=read_gcs_file,
)
sql_query = "{{ task_instance.xcom_pull(task_ids='read_gcs') }}" # store returned value from read_gcs_op
def query_bq(sql):
hook = BigQueryHook(bigquery_conn_id="bigquery_default", delegate_to=None, use_legacy_sql=False)
client = bigquery.Client(project=hook._get_field("project"), credentials=hook._get_credentials())
client.query(sql) # If you are not doing DML, you assign this to a variable and return the value
execute_query = python.PythonOperator(
task_id='query_bq',
provide_context=True,
python_callable=query_bq,
op_kwargs = {
"sql": sql_query
}
)
read_gcs_op >> execute_query
For testing I used an INSERT statement as the SQL script used by the script above:
sql_script.txt
INSERT `your-project.dataset.your_table` (name, age)
VALUES('Brady', 44)
Test done:
Return value of task read_gcs:
After Composer is done executing read_gcs and query_bq, I checked my table insert statement succeeded.:
In Airflow, I have an XCom Task ID customer_schema that I want to transform into a JSON file named final_schema.json and upload into Google Cloud Storage. My bucket in Google Cloud Storage is named northern_industrial_customer. I tried to use the following FileToGoogleCloudStorageOperator, but it did not work.
Does anyone know how I transfer my XCom Task ID customer_schema to Google Cloud Storage as a JSON file named final_schema.json?
transfer_to_gcs = FileToGoogleCloudStorageOperator(task_id = 'transfer_to_gcs', src = "{{task_instance.xcom_pull(task_ids='customer_schema')}}", dst = 'final_schema.json', bucket = 'northern_industrial_customer', google_cloud_storage_conn_id = conn_id_gcs)
there is no operator in Airflow to perform these operation, but Airflow is extensible and you can write your own custom operator.
import tempfile
import warnings
from airflow.gcp.hooks.gcs import GoogleCloudStorageHook
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
class ContentToGoogleCloudStorageOperator(BaseOperator):
"""
Uploads a text content to Google Cloud Storage.
Optionally can compress the content for upload.
:param content: Content to upload. (templated)
:type src: str
:param dst: Destination path within the specified bucket, it must be the full file path
to destination object on GCS, including GCS object (ex. `path/to/file.txt`) (templated)
:type dst: str
:param bucket: The bucket to upload to. (templated)
:type bucket: str
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud Platform.
:type gcp_conn_id: str
:param mime_type: The mime-type string
:type mime_type: str
:param delegate_to: The account to impersonate, if any
:type delegate_to: str
:param gzip: Allows for file to be compressed and uploaded as gzip
:type gzip: bool
"""
template_fields = ('src', 'dst', 'bucket')
#apply_defaults
def __init__(self,
content,
dst,
bucket,
gcp_conn_id='google_cloud_default',
mime_type='application/octet-stream',
delegate_to=None,
gzip=False,
*args,
**kwargs):
super().__init__(*args, **kwargs)
self.content = content
self.dst = dst
self.bucket = bucket
self.gcp_conn_id = gcp_conn_id
self.mime_type = mime_type
self.delegate_to = delegate_to
self.gzip = gzip
def execute(self, context):
"""
Uploads the file to Google cloud storage
"""
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to
)
with tempfile.NamedTemporaryFile(prefix="gcs-local") as file:
file.write(self.content)
file.flush()
hook.upload(
bucket_name=self.bucket,
object_name=self.dst,
mime_type=self.mime_type,
filename=file.name,
gzip=self.gzip,
)
transfer_to_gcs = ContentToGoogleCloudStorageOperator(
task_id = 'transfer_to_gcs',
content = "{{task_instance.xcom_pull(task_ids='customer_schema')}}",
dst = 'final_schema.json',
bucket = 'northern_industrial_customer',
gcp_conn_id = conn_id_gcs)
Please note that in Airflow 2.0 the google_cloud_storage_conn_id parameter in the FileToGoogleCloudStorageOperator operator is discontinued. You should use gcp_conn_id
I am trying to append data to BQ table using python code which requires dynamic schema handling.
Can anyone provide me the link to handle above scenario.
An example code of loading a .csv file into BigQuery using the python client library:
# from google.cloud import bigquery
# client = bigquery.Client()
# filename = '/path/to/file.csv'
# dataset_id = 'my_dataset'
# table_id = 'my_table'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True
with open(filename, "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))
Also check this part of the documentation to know more about appending data into tables from a source file using the same or different schema.
I searched in the boto3 doc but didn't find relevant information there. In this link, it is mentioned that it can be done using
k.storage_class='STANDARD_IA'
Can someone share a full code snippet here? Many thanks.
New file
import boto3
client = boto3.client('s3')
client.upload_file(
Filename = '/tmp/foo.txt',
Bucket = 'my-bucket',
Key = 'foo.txt',
ExtraArgs = {
'StorageClass': 'STANDARD_IA'
}
)
Existing file
From How to change storage class of existing key via boto3:
import boto3
s3 = boto3.client('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.copy(
CopySource = copy_source,
Bucket = 'target-bucket',
Key = 'target-key',
ExtraArgs = {
'StorageClass': 'STANDARD_IA',
'MetadataDirective': 'COPY'
}
)
From the boto3 Storing Data example, it looks like the standard way to put objects in boto3 is
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
But to set the storage class, S3.Object.Put suggests we'd want to use parameter:
StorageClass='STANDARD_IA'
So combining the two, we have:
import boto3
s3 = boto3.resource('s3')
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'), StorageClass='STANDARD_IA')
Hope that helps