How best cache bigquery table for fast lookup of individual row? - google-cloud-platform

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?

Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Related

Billions of rows table as input to Quicksight dataset

there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.
You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?
A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

AWS Athena Query Partitioning

I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.
With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.

Redshift performance: SQL queries vs table normalization

I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.
The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:
select A, B, C from redshift__table where D='x' and E = 'y';
After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.
Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?
Now let me describe an alternate scenario:
If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)
But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.
So I wish to know on a high level which approach would be better:
Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.
Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.
Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.
I'd definitely go with your second option, which involves JOINing with queries. That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly).
Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. You'll have a lot less queries that way.
Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. It depends how quickly you'll need to query the data after loading.