BigQuery read is Slow in Google DataFlow pipeline

BigQuery read is Slow in Google DataFlow pipeline - google-cloud-platform

For our Near real time analytics, data will be streamed into pubsub and Apache beam dataflow pipeline will process by first writing into bigquery and then do the aggregate processing by reading again from bigquery then storing the aggregated results in Hbase for OLAP cube Computation.
Here is the sample ParDo function which is used to fetch record from bigquery
String eventInsertedQuery="Select count(*) as usercount from <tablename> where <condition>";
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig
=QueryJobConfiguration.newBuilder(eventInsertedQuery).build();
TableResult result = bigquery.query(queryConfig);
FieldValueList row = result.getValues().iterator().next();
LOG.info("rowCounttt {}",row.get("usercount").getStringValue());
bigquery.query is taking aroud ~4 seconds. Any suggestions to improve it? Since this is near real time analytics this time duration is not acceptable.

Frequent reads from BigQuery can add undesired latency in your app. If we consider that BigQuery is a data warehouse for Analytics, I would think that 4 seconds is a good response time. I would suggest to optimize the query to reduce the 4 seconds threshold.
Following is a list of possibilities you can opt to:
Optimizing the query statement, including changing the Database schema to add partitioning or clustering.
Using a relational database provided by Cloud SQL for getting better response times.
Changing the architecture of you app. As recommended in comments, it is a good option to transform the data before writing to BQ, so you can avoid the latency of querying the data twice. There are several articles to perform Near Real Time computation with Dataflow (e.g. building real time app and real time aggregate data).
On the other hand, keep in mind that the time to finish a query is not included in the BigQuery SLAs webpage, in fact, it is expected that errors can occur and consume even more time to finish a query, see Back-off Requirements in the same link.

Related

BigQuery with BI Engine is slower than BigQuery with cache

I've read almost all the threads about how to improve BigQuery performance, to retrieve data in milliseconds or at least under a second.
I decided to use BI Engine for the purpose because it has seamless integration without code changes, it supports partitioning, smart offloading, real-time data, built-in compression, low latency, etc.
Unfortunately for the same query, I got a slower response time with the BI engine enabled, than just the query with cache enabled.
BigQuery with cache hit
Average 691ms response time from BigQuery API
https://gist.github.com/bgizdov/b96c6c3d795f5f14e5e9a3e9d7091d85
BigQuery + BiEngine
Average 1605ms response time from BigQuery API.
finalExecutionDurationMs is about 200-300ms, but the total time to retrieve the data (just 8 rows) is 5-6 times more.
BigQuery UI: Elapsed 766ms, the actual time for their call to REST entity service is 1.50s. This explains why I get similar results.
https://gist.github.com/bgizdov/fcabcbce9f96cf7dc618298b2d69575d
I am using Quarkus with BigQuery integration and measuring the time for the query with Stopwatch by Guava.
The table is about 350MB, the BI reservation is 1GB.
The returned rows are 8, aggregated from 300 rows. This is a very small data size with a simple query.
I know BigQuery does not perform well with small data sizes, or it doesn't matter, but I want to get data for under a second, that's why I tried BI, and it will not improve with big datasets.

Could you please share job id?
BI Engine enables a number of optimizations, and for vast majority of queries they allow significantly faster and efficient processing.
However, there are corner cases when BI Engine optimizations are not as effective. One issue is initial loading of the data - we fetch data into RAM using optimal encoding, whereas BigQuery processes data directly. Subsequent queries should be faster. Another is - some operators are very easy to optimize to maximize CPU utilization (e.g. aggregations/filtering/compute), while others may be more tricky.

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?

A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.

Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.

Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

Use case for dataflow (small SQL queries)

We're using Cloud Function to transform our datas in BigQuery :
- all datas are in BigQuery
- to transform data, we only use SQL queries in BigQuery
- each query runs once a day
- our biggest SQL query runs for about 2 to 3 minutes, but most queries runs for less than 30 seconds
- we have about 50 queries executed once a day, and this number is increasing
We tried at first to do the same thing (SQL queries in BigQuery) with Dataflow, but :
- it took about 10 to 15 minutes just to start dataflow
- it is more complicated to code than our cloud functions
- at that time, Dataflow SQL was not implemented
Every time we talk with someone using GCP (users, trainers or auditers), they recommend using Dataflow.
So did we miss something "magic" with Dataflow, in our use case? Is there a way to make it start in seconds and not in minutes?
Also, if we use streaming in Dataflow, how are costs calculated? I understand that in batch we pay for what we use, but what if we use streaming? Is it counted as a full-time running service?
Thanks for your help

For the first part, BigQuery VS Dataflow, I discussed this with Google weeks ago and their advice is clear:
When you can express your transformation in SQL, and you can reach your data with BigQuery (external table), it's always quicker and cheaper with BigQuery. Even if the request is complex.
For all the other use cases, Dataflow is the most recommended.
For realtime (with true need of realtime, with metrics figured out on the fly with windowing)
When you need to reach external API (ML, external service,...)
When you need to sink into something else than BigQuery (Firestore, BigTable, Cloud SQL,...) or read from a source not reachable by BigQuery.
And yes, Dataflow start in 3 minutes and stop in again 3 minutes. It's long... and you pay for this useless time.
For batch, like for streaming, you simply pay for the number (and the size) of the Compute Engine used for your pipeline. Dataflow scale automatically in the boundaries that you provide. Streaming pipeline don't scale to 0. If you haven't message in your PubSub, you still have at least 1 VM up and you pay for it.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?

Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js