We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could just tell me in the answer) how loading different formats would compare in efficiency.
For example, if we have the same 100M rows of data, does BigQuery show any performance difference from loading it in:
parquet
csv
json
avro
I'm sure one of the answers will be "why don't you test it", but we're hoping that before architecting a converter or re-writing our application, an engineer could share with us what (if any) of the above formats would be the most performant in terms of loading data from a flat file into BQ.
Note: all of the above files would be stored in Google Cloud Storage: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage.
https://cloud.google.com/blog/big-data/2016/03/improve-bigquery-ingestion-times-10x-by-using-avro-source-format
"Improve BigQuery ingestion times 10x by using Avro source format"
The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.
In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values
Take care not to limit your comparison to just benchmarks. Those formats also imply some limitations for the client that writes data into BigQuery, and you should also consider them. For instance:
Size of the allowed compressed files (https://cloud.google.com/bigquery/quotas#load_jobs )
CSV is quite "fragile" has for serialization format (no control of types for instance)
Avro offers poor support for types like Timestamp, Date, Time.
Related
My team is working on developing data platform using Google Cloud Platform.
We uploaded our company's data on Google Cloud Storage and try to make data mart on Bigquery.
However, in order to save GCP usage cost, we are considering to load all data from gcs to bigquery or create external table on bigquery.
Which way is more cost efficienct?
BigQuery and the external table capacity make the border between datalake (file) and data warehouse (structured data) blurry, and your question is relevant.
When you use external table, several feature are missing, like clustering and partitioning, and your file are parsed on the fly (with type casting) -> the processing time is slower and you can't control/limit the volume of data that your process. In addition of possible errors in file that will break your query
When you use native table, the data storage is optimize for the BigQuery processing, the data already clean and parsed, the table partitioned and clustered.
The question of cost is hard multiple. Firstly, we can talk about data storage. if you have file in GCS and the same data in BigQuery, you will pay the storage twice. However, after 90 days without any update, the data goes to "archive" storage mode in BigQuery and are 2 time cheaper. In addition, you can also move your GCS file to a cold storage after their integration in BigQuery.
That's for the storage. Then the processing. First of all, the processing roughly cost 10 times more than the storage, and it's the most important things to focus on. When you perform a BigQuery request, you pay for the volume of data that your query scan. If you have partitions or clusters, with BigQuery native tables, you can limit the amount of data that you scan and therefore reduce a lot the cost. With external tables, you can't use partitioning and clustering feature and therefore you always pay for the full amount of data.
Therefore, it depends (as always) on your volume of data and the frequency of the requests.
Don't forget something additional: with external table you can have error that can break your queries. In production mode, it can be dramatic. Think smart on that.
Finally, requesting external table is slower that native table (no partitioning, therefore more data to process and parsing/casting duration). Because time is money (if you have time critical queries), and that immaterial cost can also influence your choices.
The #guillaume blaquiere answer is okay, but he forget mention something important: it is possible to do partitioned queries. You can create partitioned external tables linked to a bucket in the storage. Eg:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar
Then, you can use "dt" or "lang" filters in SQL queries from BigQuery.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs
We are in Google Cloud Platform so technologies there would be a good win. We have a huge file that comes in and dataflow scales on the input to break up the file quite nicely. After that however, it streams through many system, microservice1 over to dataconnectors grabbing related data over to ML and finally over to a final microservice.
Since the final stage could be around 200-1000 servers depending on load, how can we take all the requests coming in (yes, we have a file id attached to every request including a customerRequestId in case a file is dropped multiple times). We only need to be writing every line with the same customerRequestId to the same file on output.
What is the best method to do this? The resulting file is almost always a csv file.
Any ideas or good options I can explore? I wonder if dataflow was good at ingestion and reading a massively large file in parallel, is it good at taking in various inputs on a cluster of nodes(not a single node which would bottleneck us).
EDIT: I seem to recall hdfs has files partitioned across nodes and I think can be written by many nodes at the same time somehow (a
node per partition). Does anyone know if google cloud storage files are this way as well? Is there a way to have 200 nodes writing to 200 partitions of the same file in google cloud storage in such a way that it is all 1 file?
EDIT 2:
I see that there is a streaming pub/sub to bigquery option that could be done as one stage in this list: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
HOWEVER in this list, there is not a batch bigquery to csv(what our customer wants). I do see a bigquery to parquet option though here: https://cloud.google.com/dataflow/docs/guides/templates/provided-batch
I would prefer to go directly to csv though. Is there a way?
thanks,
Dean
You case is complex and hard (and expensive) to reproduce. My first idea is to use BigQuery. Sink all the data in the same table with Dataflow.
Then, create a temporary table with only the data to export to CSV like that
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
And then to export the temporary table to CSV. If the table is less than 1Gb, only one CSV will be generated.
If you need to orchestrate these steps, you can use Workflows
I've got an external table in BigQuery that pulls its data from Avro files on Google Cloud Storage. I'm currently hive partitioning the data on date as every query will use the date, with an emphasis on newer data. I'm considering also partitioning further on organisation.
I'm not finding much information on the best practices in terms of partitioning to maintain performance and keep costs low. Should I be aiming to keep the number of file reads low (ie have a small number of larger files) or should I be looking to keep the number of bytes being read by BigQuery low (more, smaller files with a fine-grained partition strategy)? Or perhaps it's more nuanced and there's a balance to be kept?
I know this is a tough question without understanding the dataset and queries but I just want to find somewhere to start from rather than just guessing and having to change it later.
There is no general prescription approaching best performance querying a data stored externally (federated data) behind Bigquеry as it mostly depends on the use case and customer purpose, citing the GCP documentation:
Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and
writing the cleaned result into BigQuery storage.
Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently
changing data does not need to be reloaded every time it is
updated.
As I mentioned in the comment, due to external data source limitations, if the query performance is the leading factor, when it is recommended to switch to classic way loading data to Bigquery sink:
Query performance for external data sources may not be as high as
querying data in a native BigQuery table. If query speed is a
priority, load the data into BigQuery instead of setting up an
external data source.
Having said this, there is no specific enhancement in the I/O operations with GCS in terms of usage it with Bigquery external data sources:
In general, query performance for external data sources should be
equivalent to reading the data directly from the external storage.
Also, is there anything wrong with doing transforms/joins directly within BigQuery? I'd like to minimize the number of components and steps involved for a data warehouse I'm setting up (simple transaction and inventory data for a chain of retail stores.)
Well, if you go through GCS it means you are not streaming your data, and loading from file to BQ is free, and files can be up to 5TB in size. Which is sometimes and advantage, the large file capability and being free. Also streamin is realtime, and going through GCS means it's not realtime.
If you want to directly stream data into BQ tables that has a cost. Currently the price for streaming is $0.01 per 200 MB (June 2018), so around $50 for 1TB.
On the other hand, transformation can be done with SQL if you can express the task. Otherwise you have plenty of options, people most of the time us a Dataflow to transform things. See the linked tutorial for an advanced example.
Look also into
Cloud Dataprep - Data Preparation and Data Cleansing and
Google Data Studio: Easily Build Custom Reports and Dashboards
Also an advanced example:
Performing ETL from a Relational Database into BigQuery
Loading data via Cloud Storage is the fastest (and the cheapest) way.
Loading directly can be done via app (using streaming insert which add some additional cost)
For the doing transformation - if what are you plan/need to do can be done in BigQuery - you should do it in BigQuery :) - it is the best and fastest way of doing ETL.
But you should take in account cost of running query (if you not paying Google for slots - it could be 5$ per 1TB scans)
Another good options for complex ETL is using Data Flow - but it can became expensive very quick - in exchange of more flexibility.
There are a lot of files with size from 1Kb to 5Mb on our servers. Total size of those files is about 7Tb. Process algorithm - read and make some decisions about this file. Files may have several formats: doc, txt, png, bmp and etc. Therefore I can't merge those files to get bigger files.
How I can effectively store and process those files? What technology fits well to this task?
You can use various technologies to store and process these files. Below mentioned are the technologies that you can use.
1 Apache Kafka: You can create different topics for each format and push your data in these topics.
Advantage :
Based on your load you can easily increase your consumption speed.
2 Hadoop: you can store your data in hdfs the format and can design MR jobs to process.
3 You can use any document storage NOSQL database to store your data
Note: All the above solutions will store your data in distributed format and you can run it on commodity machines
Store your data in clouds(AWS, Google, Azure) and use there API to get and process the data. (If you want your data to be shared with the other applications also)
Start by segregating files into Different directories based on types. You can even have partition withing the individual directories. Example /data/images/YYYY-MM-DD , /data/text/YYYY-MM-DD
Use multipleInputs with appropriate InputFormat for each Path.
Normalize the data into a generic format before sending it to the reducer if needed.
There are ways to ingest data for your need .
Use Kafka to store data under different topics based on type(image , text ) and then copy to hdfs from kafka
Use Flume
As you have huge amount of data ,
please rollup the data in HDFS on a weekly basis . You can use oozie or falcon to automate the weekly rollup process
Use CombinedInPutFormat in your Spark or MR code.
Last but not the least map the data as table using Hive to expose it to external clients.
Hadoop archieves (HAR) is usual way to address this.
More details about this are available on : https://hadoop.apache.org/docs/r2.7.0/hadoop-archives/HadoopArchives.html
You also have option to use SequenceFile, HBase as described in : https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
But, looking at your usecase HAR fits the bill.