Backup/Remove Datastore entities to Big Query Automatically at some interval - python-2.7

I am looking to stream some data into Big Query and had a question around Step 3 of Google's best practices for Streaming data into Big Query. The process makes sense at a high level but I'm struggling with the implementation for step 3. (I am looking to use the datastore as my transactional data store.) For step 3 says to "reconciled data from the transactional data store and truncate the unreconciled data table.". My question is this; If my reconciled data is in the Google Datastore is there a way to automate the backup and deletion of this data without manually intervention?
I know I could achieve this recommended practice by using the Datastore Admin. I could:
1) Pause all writes to the datastore
2) Backup the datastore table to Cloud Storage
3) Delete all entities in the table I just backed up.
4) Import the backup into Big Query
Is there a way I can automate this so I don't have to do it manually at regular intervals?
Real-time dashboards and queries
In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated data, ensure that you have a primary, transactional data store outside of BigQuery.
You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:
1) Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
2) On the client side, maintain a transactional data store for records.
Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
3) At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
4) For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.

Related

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?
A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Best strategy for building Redshift Data Warehouse from multiple DBs

I need some guidance on our strategy for loading data into a Redshift Data Warehouse for analytics. We have ~40 SQL databases, each represents one customer and each database is identical. I have a SQL database with the same table structure as the 40 but each table has an additional column called "customer" that will capture where that record came from. We do some additional ETL processing with the records as they come in.
In total we have about 50 GB of data across all 40 DBs. Looking into the recommended processes for Updating / Inserting data on AWS's site they recommend creating the scratch table then merging data. I could do this but I could also just drop all the data from a table and re-load it since I am reading from the source every time. What is the recommended way to handle this?

Redshift performance: SQL queries vs table normalization

I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster.
The idea is to use Kinesis firehose to pump data to redshift using COPY command. But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below:
select A, B, C from redshift__table where D='x' and E = 'y';
After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. Kinesis will then do its job and issue the required COPY command.
Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications?
Now let me describe an alternate scenario:
If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds)
But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data.
So I wish to know on a high level which approach would be better:
Have a single flat table but query it before issuing a request to kinesis on an event notification. There wont be any table joins while performing analytics.
Have 2 tables and query redshift less often. But perform a table join while displaying results using BI/analytical tools.
Which of these 2 do you think is a better option? Let us assume that I will use appropriate sort keys/distribution keys in either cases.
I'd definitely go with your second option, which involves JOINing with queries. That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly).
Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. You'll have a lot less queries that way.
Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. It depends how quickly you'll need to query the data after loading.