Cloud SQL to BigQuery ETL tool - google-cloud-platform

I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?

Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.

Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!

Related

Simple Search capabilities with NoSql (DynamoDB)

I am new to NoSQL. I am trying to make simple app which will have products that you search through. With SQL I would simply have a products table and be able to search any of the columns for substrings with %LIKE% and pull the returned rows. I would like to use DynamoDB, but seemingly there is no way of doing this without introducing AWS OpenSearch (ElasticSearch) which will probably cost more than all my DynamoDb tables. Is there any simple way to do this in DynamoDb without having to scan the whole table and filtering with contains?
No, there is no way to do what you want (search dynamodb) without adding in another layer such as elasticsearch - keep it simple, use a traditional database.
IMO, never assume you need a nosql database - because you rarely do - always assume you need a traditional database until proven otherwise.
Ok so DynamoDB is not what you are looking for, it is designed for a very different use case.
However, ElasticSearch which is in no way tied to DynamoDB very much is what you are looking for and will greatly simplify what you are trying to over using a traditional SQL database. Those who are saying otherwise, are providing poor information. A traditional database cannot index a %LIKE% query, where this is precisely what ElasticSearch does on every field in your document.
Getting started with ElasticSearch is super easy. Just download the Jar and run it, then start going through examples posting and getting documents from the index. If your experience is anything like mine, you will never really want to use a SQL database again, but as is mentioned they each have their own place, and so I do still use traditional RDBMS but I specialize in ElasticSearch.
I have converted many applications that were unable to find reasonable performance, to ElasticSearch where the performance is almost always sub second, and typically a fraction of that. An RDBMS being asked to do many %LIKE% matches will not be able to provide you sub second results.
There are also a number of tools that will automatically funnel data from your RDBMS db into ElasticSearch so that you can have the benefits of both worlds.
NoSQL means a great many things. In general it has been applied to at several classes of datastore.
Columnar Datastore - DynamoDB, Hive
Document/Object Database - MongoDB, CouchDB, MarkLogic, and a great many others
Key/Value - Cassandra, MongoDB, Redis, Memcache
Search Index - SOLR, ElasticSearch, MarkLogic
ElasticSearch bridges the gap between Document Database and Search Index, providing the features of both. It also provides the capabilities of a Key/Value data store.
The columnar datastore is much more tuned for doing work across massive amounts of data, generally in an aggregate, but results from the queries are not the kind of performance you are looking for. These are used for datasets with trillions of rows and hundreds of features/columns.
ElasticSearch however provides very rapid search across large numbers of JSON documents index by default every value in the json.
The way to do this with dynamodb is by using ElasticSearch, however, you do not need DynamoDB to do this with ElasticSearch, so you don't need double the cost.

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.

Online update spanner schema is extremely slow

Online updating spanner schema takes minutes even for very very small tables (10s of rows).
i.e. - adding/dropping/altering columns, adding tables, etc.
This can be quite frustrating for development processes and new version deployments.
Any plans for improvement?
Few more questions:
Anyone knows a 3rd party schema comparison tool for spanner? couldn't find any.
What about data backups? in order to save historical snapshots.
Thanks in advance
Schema Updates:
Since Cloud Spanner is a distributed database, it makes sure to update all moving parts of the system which takes the latency as described.
As a suggestion, you could batch the schema updates. This ensures the lower latencies (nearly equivalent to executing a single schema update) and can be executed using API / gcloud command-line tools.
Schema Comparison Tool:
You could use the getDatabaseDdl API to maintain history of your schema changes and use your tool of choice to diff them.

AWS Redshift vs Snowflake use cases

I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.

QUICKLY exporting a dynamo DB table to S3

So I want to dump an entire DynamoDB table to S3. This tutorial gives a good explanation of how to do so. Gave it a test, it worked...great
However now I want to use it on my production data which is sizeable (>100GB). And I want it to run quickly. Obviously the read throughput on my DynamoDB table is a factor here, but is there a way to make sure the data pipeline is doing everything it can. I'm not super familiar with these, the architect view after the setup has areas for instance types and instance count, but will increasing these decrease my pipeline time? The tutorial doesn't mention anything about speed except for specifying the throughput of the table you meant to use. Will it scale automatically based upon that?
The template is based on the open source samples that the datapipeline team has on gihub.
The template you are referring to is here .
If you take a look at the pipeline definition, you will find that the export is being done via a map-reduce job. The scalability of the export job should be handled by that.
If you need to get a few more details on how EMR works with DynamoDB, you will find it at the here. If you increase the number of instances you would need to adjust the throughput of your table accordingly to increase the parallelism of the export.