Data streaming from raspberry pi CSV file to BigQuery table - google-cloud-platform

I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?

Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!

Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer

Related

Best way to export a table from BigQuery to GCS

I have some questions related to Cloud Composer and BigQuery. We need to import and create an automated process to export tables from BigQuery to Storage.
I have 4 options at the moment:
bigquery_to_gcs Operator
BashOperator: Executing the "bq" command provided by the Cloud SDK on Cloud Composer.
Python Function: Create a Python function using the BigQuery API, almost the same as bigquery_to_gcs and execute this function with Airflow.
DataFlow: The job will be executed with Airflow too.
I have some thoughts related to the first 3 options thought. If the table is huge, is there a chance to consume a big part of the resources of Cloud Composer? I've been searching if the bashoperator and bigquery operator consumes some resources of Cloud Composer. Always thinking that this process is going to be in production in the future and more dags are running at the same time. If that’s true, Dataflow will be a more convenient option?
A good approach of dataflow is that we can export the table in just one file if we want, that's not possible with the other options if the table is more than 1GB.
BigQuery itself has a feature to export data to GCS. This means that if you use any of the things you mentioned (except for the Dataflow job), you will simply trigger an export job that will be performed and managed by BigQuery.
This means that you do not need to worry about the consumption of cluster resources in Composer. bigquery_to_gcs operator is simply the controller instructing BigQuery to do an export.
So, from the options you mention: bigquery_to_gcs operator, BashOperator, and Python function will incur a similar low cost. Just use whichever you find easier to manage.

How to schedule a query (Export Data) from Google Big Query to External Storage space (Eg: Box)

I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files

Faster development turnaround time with AWS Glue

AWS Glue looks promising but I'm having a challenge with the development cycle time. If I edit PySpark scripts through the AWS console, it takes several minutes to run even on a minimal test dataset. This makes it a challenge to iterate quickly if I have to wait 3-5 minutes just to see whether I called the right method on glueContext or understood a particular DynamicFrame behavior.
What techniques would allow me to iterate faster?
I suppose I could develop Spark code locally, and deploy it to Glue as an execution framework. But if I need to test code with Glue-specific extensions, I am stuck.
For development and testing scripts Glue has Development Endpoints which you can use with notebooks like Zeppelin installed either on a local machine or on Amazon EC2 instance (other options are 'REPL Shell' and 'PyCharm Professional').
Please don't forget to remove the endpoint when you are done with testing since you pay for it even if it's idling.
I keep pyspark code in separate class file and glue code in another file. We use glue for reading and writing data only. We do test driven development using pytest in local machine. No need of dev endpoint or zeppelin. Once all syntactical or business logic specific bugs are fixed in pyspark, end-to-end testing is done using glue. We also wrote shell script, which uploads latest code to S3 bucket from which glue job is run.
https://github.com/fatangare/aws-glue-deploy-utility
https://github.com/fatangare/aws-python-shell-deploy

Google Dataprep: Scheduling with updated data source

Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.

Google App Engine Parse Logs in DataStore Save to Table

I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py