What is considered a concurrent query in BigQuery? - google-cloud-platform

We are running into quota limits for a small data set which is less than 1Gb in Bigquery. Google cloud gives us no indication of what queries are running on the backend which isn't allowing us to tune the setup. We have a Bigquery dataset and a dashboard built in data studio which is querying on the data set.
I've used relational databases like Oracle in the past and they have excellent tooling to diagnose issues. But with Bigquery, I feel like I am staring into the dark.
I'd appreciate any help/pointers you can give.

The concurrent queries limit makes reference to the number of statements that are executed simultaneously in BigQuery. The quota limit for on-demand, interactive queries is 100 concurrent queries (updated).
Based on this, it is seems that your Data Studio is hitting this quota when running your reports in which case is suggested to re-design your dashboard build in order to avoid exceeding those limits.
Additionally, you can use the bq ls -j -a PROJECTNAME command to list the jobs that have been run in your project in order to identify the queries you need to work with, as well mentioned by Elliott Brossard

Related

Online update spanner schema is extremely slow

Online updating spanner schema takes minutes even for very very small tables (10s of rows).
i.e. - adding/dropping/altering columns, adding tables, etc.
This can be quite frustrating for development processes and new version deployments.
Any plans for improvement?
Few more questions:
Anyone knows a 3rd party schema comparison tool for spanner? couldn't find any.
What about data backups? in order to save historical snapshots.
Thanks in advance
Schema Updates:
Since Cloud Spanner is a distributed database, it makes sure to update all moving parts of the system which takes the latency as described.
As a suggestion, you could batch the schema updates. This ensures the lower latencies (nearly equivalent to executing a single schema update) and can be executed using API / gcloud command-line tools.
Schema Comparison Tool:
You could use the getDatabaseDdl API to maintain history of your schema changes and use your tool of choice to diff them.

AWS Redshift vs Snowflake use cases

I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.

Cloud Spanner query performance regression

We've noticed that some of our queries have seen degraded performance in the last couple of weeks. We suspect this is due to some combination of:
Increased data in the tables
Increased data in some results
Inefficient or over-aggressive use of transactions
Any advice on how to diagnose the performance of a particular query?
When running an interactive query against your database in the Google Cloud Platform online management console, you can request generation of a plan explanation with the tab below the 'Run Query' button. This explanation may help you understand why your query is running slowly.
One common reason for performance regressions is that you have recently deleted or updated a lot of data. It can take several days for deleted/overwritten data to be garbage-collected, and in the interim it can slow down operations since this old data must still be scanned for queries over its key-range.

Cloud SQL Migrate from 1st to 2nd generation

We currently have a Cloud SQL instance with about 600 databases (10Gb total) and we have had several problems with crashes of the instance, so we are thinking about moving to a 2nd generation instance. However, I have found no tool in the console that does this.
Is there some way to do this other than exporting everything as SQL and then executing all queries in the new instance?
And as a side note, is there some limit to the amount of databases per instance? I have found no information on how many databases are recommended to avoid performance and reliability issues.
Thank you
Export and import is the way to do it currently.
Google Cloud SQL uses practically unmodified MySQL binaries, so you can find the limits in the MySQL doc. This one is for 5.6: https://dev.mysql.com/doc/refman/5.6/en/database-count-limit.html
The underlying OS, however, is a custom variant of Linux, and the limits is not documented at this point, but you are probably doing something wrong if you exceed the limits of the OS.

Are azure web jobs appropriate to import large amounts of data

I am working on an application where we receive csv files from a govt. dept. that has approx 1.5 million rows, monthly. We have to get this into azure table storage. We are trying to avoid having to provision VM's for this and are wondering if webjobs are a good choice for such a large dataset?
Thanks.
Yes, they should work. WebJobs are nothing more that a process running on the website machine.
You'll probably want to turn on the "Always On" feature if your webjob will take a long time to complete.