We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).
However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.
Without a defense of the BigQuery product, I was wondering if there were any plans on improving:
The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.
You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.
After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.
I believe this can solve the query latency issue mentioned by David542 question.
Related
I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.
In this repository, its author mentions that we can stage OLAP cubes in Cassandra or S3:
Once the data is in Redshift, our chief goal is for the BI apps to be
able to connect to Redshift cluster and do some analysis. The BI apps
can either directly connect to the Redshift cluster or go through an
intermediate stage where data is in the form of aggregations
represented by OLAP cubes.
How is it possible? How would that work? Am I missing any essential concept? As I understand OLAP cubes are a special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Key features of OLAP are:
pivoting
slicing
dicing
drilling
And Redshift can do this.
It's architecture is aimed to solve OLAP and BI tasks. See amazon-redshift-developer-guide
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.
But the line between terms is very smooth.
As Diana Shealy said:
Stop Abusing OLTP as OLAP
There’s a lot of confusion in the market between OLTP and OLAP, and due to the high price of commercial OLAPs, startups and budget-constrained developers have gone on to abuse an OLTP database as an OLAP database. The abuse falls into two categories:
An often multi-shard MySQL database with application layer scripting to perform historical event data analysis. Although this setup is extremely common, it is one of the least productive ways to approach analytics. MySQL is not optimized in any way for reading large ranges of data and its support for analytic functions is weak. As there are multiple alternatives, avoid this “inexpensive” solution because you’ll be paying the price in other places eventually.
Using PostgreSQL as an OLAP layer. This is a more legitimate choice than above for starting an analytics platform because of Postgres’s solid analytic User Defined Functions (UDFs). Also, thanks to its c-store extension, PostgreSQL can be turned into a columnar database, making it an affordable alternative to commercial OLAPs.
Finally, if you are considering moving from OLTPs abused as OLAPs to “real” OLAPs like Redshift, I encourage you to learn how to use Redshift’s COPY Command so that you can start seeing your data inside Redshift.
As for your questions:
How is it possible?
It's possible due to Redshift architecture (column database) and analytical features such as:
Window functions
Data Warehouse System Architecture
Performance
Columnar Storage
Internal Architecture and System Operation
Workload Management
Aggregate functions
How would that work?
See System and Architecture Overview for a detailed explanation of the Amazon Redshift data warehouse system architecture.
(Some links are already mentioned before in this post)
Essential concept?
Am I missing any essential concept?
I'd suggest more rely on technical details of specific solution instead of marketing terms. In the end, practical tasks are not solved by software naming or marketing, but with it's real functionality.
What's really important in DB landscape - is to consider two theorems:
CAP theorem
According to Iron triangle of CAP theorem, you can choose two points of three DB architecture components:
* consistency
* availability
* persistence
PIE theorem
Rick Houlihan of Amazon had a speech on choosing the DB archotecture. In addition to the CAP theorem, he also presented PIE theorem:
The PIE theorem posits that you can choose two out of three desirable features in a data system:
Pattern Flexibility
Efficiency
Infinite Scale
And Redshift is on PI dimension of the PIE triangle
Data structure
As I understand OLAP cubes are an special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Both OLAP aggregated data structures and Redshift distribution styles aimed one goal: make queries faster.
Column DB, distribution, parallel queries and other features are good for analytical tasks.
UPD
In comments you asked if Cassandra can work as OLAP service.
Cassandra and S3 can be used as a storage for pre-calculated aggregated data of dimensions.
Suppose you have 10 pretty big fact tables (each 50-100 GBs) that should be queried with Power BI. They doesn't fit into Azure Analysis Services RAM (reasonable price). So in order to use tabular model and AAS you have stay with the following schema:
(1) Power BI Desktop -> Azure Analysis Services -> [DirectQuery] -> SQL Database
But as far as I know from this article, AAS tabular model doesn't cache any aggregated results (means won't imply any additional performance optimizations). Moreover, AFAIK, Power BI (PowerPivot) already has embedded AAS.
As alternative, I can query SQL datasource directly from Power BI:
(2) Power BI Desktop -> [DirectQuery] -> SQL Database
Does the 1st schema (using AAS) provide any performance benefits over the 2nd schema (not using AAS)?
P.S. My question isn't about pros and cons of semantic layer, for that see this article. This question isn't the same as this question, because it's asking only about performance aspect of ASS DirectQuery.
The performance benefits will require testing depending on your work load, and other factors.
Caveat (This answer is based on my own and my colleagues experience and testing)
Service Standard:
From a service point of view, the main difference will be between Azure Analysis Services (AAS) and the Power BI Service (PBIS), is that AAS is a known set of hardware/performance, where as PBIS is a shared capacity, and can suffer from 'noisy neighbour' issues, if another customer is on the same cluster and using it heavily it will have an impact on you report performance.
Performance:
Essentially, PBI and AAS are doing the same thing, translating DAX to a SQL query and then returning the data. From my experience of building PBI and AAS in terms of performance there isn't much difference between the two. The main issue that tends to be the bottleneck is using a gateway to an on-prem SQL and the capacity of the SQL Server either on-prem or in the cloud. For example for better performance you can use Clustered Column Indexes to bring for example the fact tables into memory, and it is easier to increase/decrease the Azure SQL Database DTU's/capacity during business hours.
At the moment AAS doesn't have the Aggregated Mode that PBI does, which can reduce the number of queries being sent back and is a bit quicker, but also has the drawback of they have to be refreshed at some point.
I would recommend testing using for example DAX Studio to see what variability you may get in performance. My own testing has shown differences in the millisecond to 1 second range in favour of AAS.
However the benefits of the semantic layer is a powerful consideration
Connections:
AAS supports other connections such as Excel, SSMS, SSRS etc better than Power BI. Excel can connect to Power BI models with an additional plugin.
Maintainability:
Maintaining the data model across its life-cycle is a lot easier to do in Visual Studio/SSDT with Azure DevOps, Git etc. than it is in Power BI Desktop. With AAS you can also use Calculation Groups for Time Intelligence calculations, rather than multiple measure or workarounds for YTD, Parallel Period, MTD etc
If there was slightly better performance in a pure Power BI approach I would still use AAS due to the benefits of the none performance factors, it would have to show significantly improved performance before switching.
Hope that helps
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've been researching different approaches to streaming data to a real-time dashboard. One way that I have done in the past is using a star schema/dimension and fact tables. This would be an implementation of aggregate tables. For example, the dashboard would contain multiple charts, one being the total sales for the day, total sales by product, total sales by manufacturer, etc. etc.
But what if this needed to be real-time? What if the data needs to stream to these charts and do the analytical processing real-time?
I've been looking into solutions like Kinesis streams and Kafka, but I may be missing something obvious. For example, consider the following example. A company runs a website where they sell pies. The company has a backend dashboard where they keep track of all data and analytics related to sales, users, orders, etc.
Custom places order through website
The relational (mysql) database receives this new order
The charts and analytical data updates real-time on the backend, for example total sales for the day, or total sales for the year by user.
If the scenario is that this data needs to be streamed, what is the best approach to this? Aggregate tables seem like the obvious but it seems that would be periodic and not real-time. Kinesis/Kafka feels like it would fit somewhere in here. The other option would be something like Redshift but it's pretty pricey and still may not be the best way to address the issue and scale.
Here is an example of a chart that would need to be updated in real-time that could suffer by just doing place aggregate SQL queries when there are tons and tons of rows to parse.
In case of "always up-to-date" reports like this (sales, users, orders etc) that don't need live updates with near-zero-latency streaming processing might be overkill, and ROLAP-like approach seems to be more optimal in meaning of efforts/result.
You mentioned Redshift, and if you already ready to mirror your data for analytics purposes and only problem is a price you can consider another free open-source alternatives that could be used for handling OLAP (aggregate) queries in the real-time (like Yandex ClickHouse, or maybe MongoDb in some cases).
A lot of depends on the dataset size; unless you have really big data that need to be aggregated (hundreds of GB) you can try to keep using mysql and use some tricks:
use separate slave mysql server with high IOPS for analytics and replicate only tables needed to build your reports; possibly use another table engine, more suitable for analytical queries. Setup indexes specially for these queries, to avoid table full scan if you need to get numbers only for last weeks.
pre-calculate metrics for previous periods (with materialized view-like approach) and refresh them on schedule (say, daily), and then combine pre-calculated aggregates with on-the-fly aggregates only for last period to get actual report data without need to scan whole facts table each time.
use data visualization backend that can efficiently cache reports data in-memory to prevent SQL DB overload because of many similar queries (and if the same report or dashboard is displayed for 100 users SQL DB load will be the same as for 1). BTW, I develop solution like that (cannot adv it here as it is commercial product).
This is a typical trade-off for most the architects. Amazon Redshift offers exemplary read optimisations but AWS stack comes for a price. You may try using Cassandra, but it comes with its own set of challenges. When it comes to analytics, I never recommend going real time for the reasons elaborated below.
Doing analytics at real time is not desired, specially using MySQL
The solution for above comes by seggregating transactional and analytical infra. This involves cost but will make sure you don't have to spend time in housekeeping once you scale. MySQL is a row based RDBMS mostly used for storing transactional data. Being row based, it optimises writes i.e. the writes are almost real time and thus, it compromises on reads. When I say this, I refer to a typical analytics dataset running into millions of records/day. If your dataset is not that voluminous, you might still be able to render a graph showing transactional status. But since you're referring to Kafka, I assume the dataset is very large.
A real-time dashboard with visualisations gives a bad customer experience
Considering the above point, even if you go for a warehouse / a read optimised infra, you need to understand how the visualisations work. If 100 people access the dashboard at the same time, 100 connections will be made to the database, all fetching the same data, putting them in memory, applying calculations, parameters and filters defined in your dashboard, adjust the refined dataset in the visualisation and then render the dashboard. Till this time, the dashboard will simply freeze. A poorly constructed query, inefficient use of indexes etc will further make the matter worse.
The above problems will amplify more and more with the increase in your dataset. Good practices to achieve what you need would be:
To have almost realtime (delay of 1hr, 30 mins, 15 mins etc) rather than an absolute real time system. This will help you to create a flat file with the data already fetched in the memory. Your dashboard will simply read this data and will be extremely fast in terms of responses to filters etc. Also, multiple connections to databases will be avoided.
Have a data structure, database/warehouse optimised for reads.
For these types of operational analytics use-cases where the real-time nature of the data is critical, you're completely correct that most "traditional" methods can be quite clumsy, especially as your data size increases. A quick overview of your options:
Historical Approach (TLDR– Meh)
Up until about 5 years ago, the de facto way to do this looked something like
Set up a primary OLTP database that will handle the data in its raw form and have stricter guarantees on performance or ACID properties. Usually this is something SQL-esque, i.e. MySQL, PostgreSQL.
Set up a secondary OLAP database that is meant for serving offline (aka non user-facing) queries. This could also be a SQL-esque db but its schema would be drastically different because it stores the data in enriched form.
Set up some mechanism by which you can keep these 2 in sync. This pretty much boils down to either a) changing your application to always write to both databases and performing the necessary data enrichment or b) building a stand-alone application that reads from your OLTP database, performs the necessary transformations and enrichment and writes to your OLAP database
Plug your dashboard into your OLAP database which will have a schema and indexes optimized for the kind of queries you want.
Using your example about the pie store, the OLTP database would be used to store the purchases of all the pies and reference things like customer ids, billing information, delivery information, etc. In contrast, the OLAP database might just maintain a table with a schema
purchase_totals(day: Date, weekNumber: int, dayOfWeek: int, year: int, total: float)
While the weekNumber, dayOfWeek, and year and technically redundant they make your queries faster! With the proper indexes on these fields, your dashboard has turned into 5 simple (and fast!) aggregation queries with a group by and sum, and then the differences week-over-week or year-over-year can be computed on the client-side. As long as your dashboard refreshes every minute or so you have near-real-time data at your fingertips.
Current Approach (TLDR– Ok)
The recent trends in computing, database technologies, and data science/analytics have led to improvements to the above process, namely by replacing certain components of it. The changes include
Making the OLTP db, the OLAP db, or both a NoSQL database (Mongo usually being the most popular). The pro here is that you have a more flexible schema which won't break if something upstream changes (say, you start selling cakes in addition to pies).
Keeping the SQL db but shifting to cloud provider solution like AWS RDS or Google Cloud SQL. This fundamentally doesn't change anything about the architecture, but it does significantly reduce your operational burden.
Using hard-to-maintain ETL pipelines on top of streaming platforms like Kafka or AWS Kinesis to act as the middle layer between OLAP and OLTP.
Using dedicated tools for data cleaning and transformation as you plan out how to do your ETL
Using dedicated visualization tools on top of your OLAP db (think Tableau)
Using a pull-based approach for getting data out of your OLTP db or your application directly instead of waiting for it to eventually reach your OLAP db. This is helpful for online services because it actually gives you both the data you want AND confirmation that the service is alive and running well (because it just served your request for data). Systems like Prometheus are quite popular for this now.
I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.