I want to schedule the data transfer job between Cloud Storage to BigQuery.
I have one application that dumps data continuously to the GCS bucket path (let's say gs://test-bucket/data1/*.avro) that I want to move to BigQuery as soon as the object is created in GCS.
I don't want to migrate all the files available within the folder again and again. I just want to move only the newly added object after the last run in the folder.
BigQuery data transfer service is available that takes Avro files as input but not a folder and it does not provide only newly added objects instead all.
I am new to it so might be missing some functionality, How can I achieve it?
Please note- I want to schedule a job to load data at a certain
frequency (every 10 or 15 min), I don't want any solution from a
trigger perspective since the number of objects that will be generated
will be huge.
You can use Cloud Function and Storage event trigger. Just launch Cloud Function that loads data into BigQuery when new file arrives.
https://cloud.google.com/functions/docs/calling/storage
EDIT: If you have more than 1500 loads per day you can workaround with loading using BQ Storage API.
If you do not need superb performance then you can just create an external table on that folder and query it instead loading every file.
Related
I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.
I am trying to automate the entire data loading, that means whenever I upload a file to Google Cloud storage, it automatically triggers the data to be uploaded into the BigQuery dataset. I know that there is a daily set timing update available, but I want something where it only triggers whenever the CSV file is re-uploaded.
You have 2 possibilities:
Either you react on event. I mean you can plug a function on Google Cloud Storage events. In the event message you have the file stored in GCS and you can do what you want with it, for exemple run a load job from Google Cloud Storage.
Or, do nothing! Let the file in GCS and create a BigQuery federated table to read into GCS
With this 2 solutions, your data are accessible by BigQuery. Your Datastudio graph can query BigQuery, the data are here. However.
The load job is more efficient, you can partition and clusterize your data for optimize the speed and the cost. However, you duplicate your data (from GCS) and you have to code and to run your function. Anyway, cost is very low and function very simple. For Big Data it's my recommended solution
The federated table are very useful when the quantity of data is low and for occasional access or for prototyping. You can't clusterize and partition your data and the speed is lower than data loaded into BigQuery (because the CSV parsing is performing on the fly).
So, Big Data is a wide area: do you need to transform the data before the load? can you transform them after the log? How can you link query the ones after the others? ....
Don't hesitate if you have other questions on this!
I have a large amount of json files in Google cloud storage that I would like to load to Bigquery. Average file size is 5MB not compressed.
The problem is that they are not new line delimited so I can't load them as is to bigquery.
What's my best approach here? Should I use Google functions or data prep or just spin up a server and have it download the file, reformat it and upload it back to cloud storage and then to Bigquery?
Do not compress the data before loading into Bigquery. Another item, 5 MB is small for Bigquery. I would look at consolidation strategies and maybe changing file format while processing each Json file.
You can use Dataprep, Dataflow or even Dataproc. Depending on how many files, this may be the best choice. Anything larger than say 100,000 5 MB files will require one of these big systems with many nodes.
Cloud Functions would take too long for anything more than a few thousand files.
Another option is to write a simple Python program that preprocesses your files on Cloud Storage and directly loads them into BigQuery. We are only talking about 20 or 30 lines of code unless you add consolidation. A 5 MB file would take about 500 ms to load and process and write back. I am not sure about the Bigquery load time. For 50,000 5 MB files, 12 to 24 hours for one thread on a large Compute Engine instance (you need high network bandwidth).
Another option is to spin up multiple Compute Engines. One engine will put the names of N files (something like 4 or 16) per message into Pub/Sub. Then multiple Compute instances subscribe to the same topic and process the files in parallel. Again, this is only another 100 lines of code.
If your project consists of many millions of files, network bandwidth and compute time will be an issue unless time is not a factor.
You can use Dataflow to do this.
Choose the “Text Files on Cloud Storage to BigQuery” template:
A pipeline that can read text files stored in GCS, perform a transform
via a user defined javascript function, and load the results into
BigQuery. This pipeline requires a javascript function and a JSON
describing the resulting BigQuery schema.
You will need to add an UDF in Javascript that converts from JSON to new line delimited JSON when creating the job.
This will retrieve the files from GCS, convert them and upload them to BigQuery automatically.
I'm new to AWS and coming from a Data Warehousing ETL background. We are currently moving to cloud using AWS services Data Lake and trying to load data into Amazon s3 landing layer (Bucket) from our external source RDBMS system using sqoop jobs and then to different layers (Buckets) in Amazon S3 using Informatica BDM.
The frequency of getting data from external source system is daily. I'm not sure how do we have to implement Delta load/SCD Types in S3. Is there any possibility to change an object after creating it in Amazon S3 bucket or do we have to keep creating copy of everyday load as an object in s3 bucket?
I understand Amazon gives us database options but we are directed to load data into Amazon S3.
Amazon S3 is simply a storage system. It will store whatever data is provided.
It is not possible to 'update' an object in Amazon S3. An object can be overwritten (replaced), but it cannot be appended.
Traditionally, information in data lakes are appended by adding additional files, such as a daily dump of information. Systems that process data out of the data lake normally process multiple files. In fact, this is a more efficient process since data can be processed in parallel rather than attempting to read a single, large file.
So, your system can either do a new, complete dump that replaces data or it can store additional files with the incremental data.
Another common practice is to partition data, which puts files into different directories such as a different directory per month or day or hour. This way, when a system processes data in the data lake, it only needs to read files in the directories that are known to contain data for a given time period. For example, if a query wishes to processes data for a given month, it only needs to read the directory with data for that month, thereby speeding the process. (Partitions can also be hierarchical, such as having directories for hour inside day inside month.)
To answer your question of "how do we have to implement Delta load/SCD Types in S3", it really depends on how you will use the data once it is in the data lake. It would be good to store the data in a manner that helps the system that will eventually consume it.
For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects