Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,
Related
I'm stucked. I have to import files and directories from remote server (proably Linux) to GCS buckets with a given scheduling time (let's say weekly).
There are several folders on the server, one for each project. Inside the project folder there are other folders that have the name of dates (20221015, 20221019, ...). Inside each of these folders there are other subfolders.
I have to migrate an entire folder starting from the date-folder, every week.\
First question: which Google Service is better for my purpose: Storage Transfer Service or Cloud Composer (is there an Operator for this task)?
Second question: how can I avoid to import date-folders that I have already imported (i.e. one week ago)?
Thank you for your help
Using Cloud Composer, you may use SFTPToGCSOperator.
Here is a working DAG on my end:
import os
from airflow import models
from airflow.models import Variable
from airflow.providers.google.cloud.transfers.sftp_to_gcs import SFTPToGCSOperator
from airflow.utils.dates import days_ago
default_args = {"start_date": days_ago(1)}
DIR_path = "<your-path>"
BUCKET_SRC = "<your-bucket>"
with models.DAG(
"dag_sftp_to_gcs", default_args=default_args, schedule_interval=None
) as dag:
copy_sftp_to_gcs = SFTPToGCSOperator(
task_id="t_sftp_to_gcs",
sftp_conn_id="<your-connection>",
gcp_conn_id="google_cloud_default",
source_path=os.path.join(DIR_path, "*.xlsx"), #you may use wildcard to get multiple files
destination_bucket=BUCKET_SRC,
)
copy_sftp_to_gcs
Remember though that you need to allow ssh in your firewall settings. You may also follow this SO post on how to create connections in Airflow.
File in my VM:
Status in Airflow Console:
Transferred file in GCS:
For your second question, you may create an internal shell script on your server that transfers/deletes file that you want to retain or remove so that correct files will be fetched by Airflow.
I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know
I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.
I want to try image segmentation with deep learning using AWS. I have my data stored on Amazon S3 and I'd like to access it from a Jupyter Notebook which is running on an Amazon EC2 instance.
I'm planning on using Tensorflow for segmentation, therefore it seemed appropriate to me to use options provided by Tensorflow themselves (https://www.tensorflow.org/deploy/s3) as it feels that in the end I want my data to be represented in the format of tf.Dataset. However, it didn't quite work out for me. I've tried the following:
filenames = ["s3://path_to_first_image.png", "s3://path_to_second_image.png"]
dataset = tf.data.TFRecordDataset(filenames)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(2):
print(sess.run(next_element))
I get the following error:
OutOfRangeError: End of sequence
[[Node: IteratorGetNext_6 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_6)]]
I'm quite new to tensorflow and have just recently started trying out some stuff with AWS, so I hope that my mistake is gonna be obvious to someone with more experience. I would greatly appreciate any help or advice! Maybe it's even the wrong way and I'm better off with something like boto3 (also stumbled upon it, but thought that tf would be more appropriate in my case) or something else?
P.S. Tensorflow also recommends to test a setup with the following piece:
from tensorflow.python.lib.io import file_io
print (file_io.stat('s3://path_to_image.png'))
For me this leads to Object doesn't exist error, though the object certainly exists and it's being listed among others if I use
for obj in s3.Bucket(name=MY_BUCKET_NAME).objects.all():
print(os.path.join(obj.bucket_name, obj.key))
I also have my credentials filled in /.aws/credentials. What might be the problem here?
Not a direct answer to your question but still something I noticed as to why you can't load data using Tensorflow.
The files in your filenames are .png and not in the .tfrecord file format which is a binary storage format. So, tf.data.TFRecordDataset(filenames) shouldn't work.
I think the following will work. Note: this is for TF2, not sure if it is the same for TF1. A similar example can be found here at TensorFlow's web site tensorflow example
Step 1
Load your files into a TensorFlow dataset with tf.data.Dataset.list_files.
import tensorflow as tf
list_ds = tf.data.Dataset.list_files(filenames)
Step 2
Make a function that will be applied to each element in the dataset by using map; this will use the function on every element in the TF dataset.
def process_path(file_path):
'''reads the path and returns an image.'''
# load the raw data from the file as a string
byteString = tf.io.read_file(file_path)
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_png(byteString, channels=3)
return img
dataset = list_ds.map(preprocess_path)
Step 3
Check out the image.
import matplotlib.pyplot as plt
for image in dataset.take(1): plt.imshow(image)
Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Check awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You can type in jupyter
import pandas as pd
from smart_open import smart_open
import os
aws_key = 'aws_key'
aws_secret = 'aws_secret'
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Also, objects stored in the buckets have a unique key value and are retrieved using a HTTP URL address. For example, if an object with a key value
/photos/mygarden.jpg
is stored in the
myawsbucket
bucket, then it is addressable using the URL
http://myawsbucket.s3.amazonaws.com/photos/mygarden.jpg.
If your data is not sensitive, you can use the http option. More details:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
You can change the setting of the bucket to public. Hope this helps.
I am have an apache-beam==2.3.0 pipeline written using the python SDK that is working with my DirectRunner locally. When I change the runner to DataflowRunner I get an error about 'storage' not being global.
Checking my code I think it's because I am using the credentials stored in my environment. In my python code I just do:
class ExtractBlobs(beam.DoFn):
def process(self, element):
from google.cloud import storage
client = storage.Client()
yield list(client.get_bucket(element).list_blobs(max_results=100))
The real issue is that I need the client so I can then get the bucket so I can then list the blobs. Everything I'm doing here is so I can list the blobs.
So if anyone can either point me the right direction towards using 'storage.Client()' in Dataflow or how to list the blobs of a GCP bucket without needing the client.
Thanks in advance!
[+] What I've read: https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-python
Fixed:
Okay so upon further reading and investigating it turns out I have the required libraries to run my pipeline locally but Dataflow needs to know these in order to download them into the resources it spins up. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
so all I've done is create a requirements.txt file with my google-cloud-* requirements.
I then spin up my pipeline like this:
python myPipeline.py --requirements_file requirements.txt --save_main_session True
that last flag is to tell it to keep the imports you do in main.