Difference between dashDB for Analytics and dashDB for Transactions? - dashdb

DashDB has plans that are listed as "transactional" or "for transactions". Under the hood, what is the difference between dashDB for Analytics and dashDB for Transactions? Are there limitations to either one?

dashDB for Analytics: Tuned for analytics (large, complex queries, sometimes called OLAP). In particular, tables are column-organized by default. (If you run a CREATE TABLE statement, by default it is ORGANIZE BY COLUMN.)
Limitation: Some things, like CLOB data types, do not work with column-organized tables. If you need a regular row-organized table in dashDB for Analytics, you need to specify it: CREATE TABLE.. ORGANIZE BY ROW.
dashDB for Transactions: Also referred to as "transactional plans" or "dashDB TX" is tuned for transactions (OLTP) or typical web workloads. Tables are organized by row (the typical SQL default).
In sum, if you're doing heavy analysis work, generating big reports, or R scripting, then choose dashDB for Analytics. For most general purpose workloads, dashDB for Transactions is a better fit.

Related

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?
A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Big Query vs Cloud SQL for audit data

Which database is better for transactional data - cloud SQL or Google Big query?
I have a requirement wherein from multiple jobs I need to load data into a single table. Which database will be better for this Google Big query or Cloud SQL?
I know that in terms of cost effectiveness cloud sql is a better choice. But are there any other pointers apart from this?
CloudSQL is a managed database for transactional loads (OLTP). It can be tweaked to work with OLAP (analytical); but it is intended to be a transactional database.
BigQuery is for analytical data (OLAP), that's data that won't change. Think of it as data "at rest" that's not going to be changed.
If some of your transactions are not finalized (there are in-flight transactions, or your end-to-end process need some steps), you need a transactional database - like the ones from Cloud SQL.
If your data won't change, use BigQuery.

Use AWS Athena With Dynamic Fields / Schemaless

We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.

What does "interactive query" or "interactive analytics" mean in Presto SQL?

The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."
An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.

Columnar database queries in Amazon Redshift

I'm learning Amazon Redshift. Heard that it is very powerful storage on cloud and works very fast on data where aggregate operations are required because it stores data column-wise.
Am not able to find any example queries? Could someone share with me some examples of Aggregate queries running on Amazon Redshift? Is it different from normal relation database queries?
You are correct -- Amazon Redshift is a columnar database. This means that data is stored on disk per column, making operations on a column very fast. For example, adding the Sales column for a particular value in the Country column only requires accessing two columns rather than all columns in a table.
Other benefits are that data in Redshift is compressed (which works well with the columnar concept, because each column uses its own compression method based on the data stored) and the fact that it is a clustered database, so compute and storage can be scaled by adding additional nodes.
Amazon Redshift presents itself as a PostgreSQL database, so you just use industry-standard SQL to query data. No changes to queries are required.
However, you can optimize Redshift by wisely choosing a Distribution Key for each table that determines how data is distributed amongst nodes, and carefully select the Sort Key, which determines how data is stored on each node. Put simply, data should be distributed by how you JOIN tables and should be sorted by what you use in WHERE statements.
As for sample queries... it totally depends upon your data! Queries look exactly the same as normal SQL.