Azure SQL DW DWC Unit Comparison / Equivalent with vCores, Memory - azure-sqldw

I'm trying to establish the compute equivalent with SQL DW DWC's with vCores, Memory?
Basically, I'm trying to make an argument with my employers that using the compute capacity currenty deployed on SQL DW, see image below, is so minimal that they would get better compute performance if they used a physical server or SQL DB. However, I don't know what the equivalent of DW100c is in terms of vCores, memory etc...
For example, if I wanted to purchase on physical server with the same compute power as that shown in the image, what would I need ask for in terms of vCores, IOPs, memory etc...?

We don't publish physical hardware specs, because we use DWUs to smooth the differences between different generations of hardware.
Any DWU scale less than 500 should really be considered a dev/test platform. You're on shared infrastructure, and the benefits of MPP for which you are paying won't be achieved.
But once you get to DWU1000c and above, you're getting huge performance gains over SMP SQL Server, Azure SQL, etc. 1000c is the lower end of performance, but I recall one exercise I was involved in last year where it was 30x faster than Azure SQL and SQL MI for the workload in question.
If you have an OLTP workload, or a mixed workload, use SMP. If you have a DW workload and your performance objectives are met by a DWU100c configuration then you are probably better off using Azure SQL, because your concurrency will be much better. At multi-terabyte scale, however, MPP will always be faster.

Related

how to test data load performance in Power BI

Report authoring in Power BI is done in Power BI Desktop, which is installed on users' workstations. Report sharing in Power BI is done in the Power BI cloud service (either shared or dedicated capacity). This means that different resources (i.e., memory, CPU, disk) are available during report authoring and report sharing, particularly for data load (dataset refresh). So, it seems impossible to test a report's data load / ETL performance prior to releasing to production (i.e., publish to the cloud service). And, usually, data load performance is faster in the cloud service than in Desktop. Because my reports contain a lot of data and transformations, data loads in Desktop can take a long time. How can I make the resources available to Desktop identical to the resources in the cloud service, so that I can reduce data load times in Desktop (during development) and to predict performance in the cloud service?
Perhaps a better question to ask is, should I even be doing this? That is, should I be trying to predict (in Desktop) a report's refresh performance in the cloud service (and / or load production-level data volumes into Desktop during development)?
Microsoft do not specify what hardware CPU/Memory is used in the Power BI Service. It is also a shared service, so more that one Power BI tenancy could be hosted on the same cluster. They do mention that you may suffer from noisy neighbour issues, so if some other tenancy is hitting it hard, your performance may suffer.
I know from experience that the memory available is greater than 25GB, as queries that have not run on Premium P1 nodes, have run ok in the service. With the dedicated nodes, you can use the admin reports to see what's going on in the background, query times, refresh time cpu/memory usage.
There are a few of issues trying to performance test Desktop vs Service. For example, a SQL query in desktop will run twice, first to check the structure and data, the second to get the data. This doesn't happen when it is deployed to the service so in that example your load will be quicker.
If you are accessing on-premise data then it will be quicker in the desktop, than the service as you'll have to go via a gateway. Also if you are connecting to an Azure SQL Database, then the connections and bandwidth between the Azure Services will be slightly quicker when you deploy it to the service, than a desktop connection to an Azure Service as the data has to travel outside the data centre to get to you.
So for importing datasets, you can look at the dataset refresh start and end times and work out how long it did take.
For a base line test, generate 1 millions rows of data, it doesn't have to be complex. Test the load time in desktop a few time to get an average, deploy and then try it in the service. Then keep adding 1 million rows to see if there is a liner relationship between the amount and time taken.
However it will not be a full like for like comparison depending on the type of data, the location and network speed, but it should give you a fair indication of any performance increase you may get when using the service to balance desktop spec to the service.
I've developed a tool at some point that uses the PowerBI-Tools-For-Capacities Microsoft under the hood.

How to understand an OLAP cube in S3 or Cassandra?

In this repository, its author mentions that we can stage OLAP cubes in Cassandra or S3:
Once the data is in Redshift, our chief goal is for the BI apps to be
able to connect to Redshift cluster and do some analysis. The BI apps
can either directly connect to the Redshift cluster or go through an
intermediate stage where data is in the form of aggregations
represented by OLAP cubes.
How is it possible? How would that work? Am I missing any essential concept? As I understand OLAP cubes are a special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Key features of OLAP are:
pivoting
slicing
dicing
drilling
And Redshift can do this.
It's architecture is aimed to solve OLAP and BI tasks. See amazon-redshift-developer-guide
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.
But the line between terms is very smooth.
As Diana Shealy said:
Stop Abusing OLTP as OLAP
There’s a lot of confusion in the market between OLTP and OLAP, and due to the high price of commercial OLAPs, startups and budget-constrained developers have gone on to abuse an OLTP database as an OLAP database. The abuse falls into two categories:
An often multi-shard MySQL database with application layer scripting to perform historical event data analysis. Although this setup is extremely common, it is one of the least productive ways to approach analytics. MySQL is not optimized in any way for reading large ranges of data and its support for analytic functions is weak. As there are multiple alternatives, avoid this “inexpensive” solution because you’ll be paying the price in other places eventually.
Using PostgreSQL as an OLAP layer. This is a more legitimate choice than above for starting an analytics platform because of Postgres’s solid analytic User Defined Functions (UDFs). Also, thanks to its c-store extension, PostgreSQL can be turned into a columnar database, making it an affordable alternative to commercial OLAPs.
Finally, if you are considering moving from OLTPs abused as OLAPs to “real” OLAPs like Redshift, I encourage you to learn how to use Redshift’s COPY Command so that you can start seeing your data inside Redshift.
As for your questions:
How is it possible?
It's possible due to Redshift architecture (column database) and analytical features such as:
Window functions
Data Warehouse System Architecture
Performance
Columnar Storage
Internal Architecture and System Operation
Workload Management
Aggregate functions
How would that work?
See System and Architecture Overview for a detailed explanation of the Amazon Redshift data warehouse system architecture.
(Some links are already mentioned before in this post)
Essential concept?
Am I missing any essential concept?
I'd suggest more rely on technical details of specific solution instead of marketing terms. In the end, practical tasks are not solved by software naming or marketing, but with it's real functionality.
What's really important in DB landscape - is to consider two theorems:
CAP theorem
According to Iron triangle of CAP theorem, you can choose two points of three DB architecture components:
* consistency
* availability
* persistence
PIE theorem
Rick Houlihan of Amazon had a speech on choosing the DB archotecture. In addition to the CAP theorem, he also presented PIE theorem:
The PIE theorem posits that you can choose two out of three desirable features in a data system:
Pattern Flexibility
Efficiency
Infinite Scale
And Redshift is on PI dimension of the PIE triangle
Data structure
As I understand OLAP cubes are an special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Both OLAP aggregated data structures and Redshift distribution styles aimed one goal: make queries faster.
Column DB, distribution, parallel queries and other features are good for analytical tasks.
UPD
In comments you asked if Cassandra can work as OLAP service.
Cassandra and S3 can be used as a storage for pre-calculated aggregated data of dimensions.

What's the difference between Google Cloud Spanner and Cloud SQL?

I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.
There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper
The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/
Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.
I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.

Sagemaker Endpoint throttling exception

I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?
you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.

Why is BigQuery so slow on non-large data sizes?

We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).
However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.
Without a defense of the BigQuery product, I was wondering if there were any plans on improving:
The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.
You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.
After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.
I believe this can solve the query latency issue mentioned by David542 question.