Should I use Data Warehouse or database or something else?

Should I use Data Warehouse or database or something else? - amazon-web-services

On current project we have a webapp with analytics module. The users select some filters and based on those filters table or graph is shown. We want the module to be responsive, so when the users select the filters it can get data in matters of seconds.
User filters are querying a large table ~1,000,000,000 rows and 20 columns (for a few years it should grow 2x/year in rows). 18 out of 20 columns are filtrable. And mostly there will be SELECT + WHERE queries.
We are not sure, should we use Data Warehouses or classical DBs.
Current reasearch suggests we should discuss between Clickhouse, DynamoDB, Snowflake, BigQuery or Redshift. Has anyone had similar use cases and which database solution would you recommend?

Since you are using the database for analytics purposes, it is recommended to use a OLAP ( Redshift)..
an OLAP database is designed to process large datasets quickly to answer questions about data.
You can compare the pricing here
https://medium.com/2359media/redshift-vs-bigquery-vs-snowflake-a-comparison-of-the-most-popular-data-warehouse-for-data-driven-cb1c10ac8555

Related

What is the approach to merge data from multiple databases (same schema) using Power BI?

I have 3 OLTP databases, all using the same database schema. Each db represents one department.
I am exploring Power BI as a solution for reporting at the company level, so all departments combined.
What is the approach to combine data from multiple dbs into a data warehouse? For example - do I need SSIS to combine the 3 dbs into 1 data warehouse?
Another option could be to have 1 shared dataset per db, and then the final report can connect and combine multiple live datasets? Or is there another way with Power BI like combining multiple live datasets?
Any reference link on how if someone has done this?

Or is there another way with Power BI
Yes. Simply create a single import model and load data from all three databases in it. So for each table in your Power BI model you would have three Power Queries set to not load into the model, and you would append them in a query that is used to load your model. See eg: https://learn.microsoft.com/en-us/power-query/append-queries

Best practice would be to:
Extract the data into a single database (DWH or reporting schema)
Build the necessary items there for your data model, be it reporting schema, or star/snowflake schemas
Connect Power BI to that schema.
Combining datasets is going to be tricky, you may have the same measures in each of the datasets. Combining in the database, with any added columns to indicate the department is the best option in terms of supporting updating/adding/removing items. For example, if the schema changes in the DB's you do it in one place, not three datasets. The toolset in DB/SSIS will be better suited to the heavy lifting of the data to a location.
You would use SSIS to extract the data if on-prem data, Azure Data Factory for Azure DB's. Extract to a staging schema, convert/transform the data into its final from, with a new schema to define what it is, facts/dimensions other schema names such as reporting can be used, depending on the data model you wish to build. Most of this is covered by the standard ETL pattern of OLTP to an OLAP database.

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?

A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

PowerBi - Connection Type (DIRECT QUERY or IMPORT DATA) Question

I am working on a PowerBi project and I need some advice/questions on the best way to approach this project. I am tasked to create a dashboard for employee metrics pulled from an onsite SQL Server database. The managers here are going to have access to the PowerBi cloud, so I will end up uploading this to the cloud. There are 10 or so metrics that need to be shown on the dashboard. We have 5000+ employees. My first thought was to create a table and dump all the metrics into a table and set the PowerBi report to import the data, but that seems excessive and a waste of space to upload all that data to the CLOUD because all of the managers don't need access to every employee. They may want to see 1 or 2 employees' metrics on the dashboard.
My second thought is to (and if this is possible) create a stored procedure that will take the employee id and output a dataset for PowerBi to create a visual for. On the dashboard, have a list of employees and when a manager selects one, PowerBi will call the stored procedure with the employee id and the dataset will be returned for PowerBi to decipher into a visual based on my measurements. I guess I would set the PowerBi report connection type as DIRECT QUERY?
Here are my questions:
Is this possible? Is it possible to what I am thinking for my second plan? Is this how DIRECT QUERY works?
If so, how does DIRECT QUERY work with the PowerBi cloud?
What is setup like? Do I just install the PowerBi Data Gateway/configure it like IMPORT DATA and PowerBi does the rest?

A couple of queries:
What is the frequency of data update ?
In case if it is a batch job, it is ideally preferable to import that data from source into powerbi model and do reporting on the imported data as
a) The performance would be quicker
b) There would be no to and for of data across on prem database and cloud
c) the source would not be impacted constantly
So is the ask to have RLS wherein the managers should see only the employees under them?
Then it is pretty easy to implement RLS in imported version rather than in case of direct query.
Also you won't be able to pass parameters to stored procedures, and you can't execute them in direct query mode. You can however, create table valued functions which give you the ability to use table variables and perform other functions that are more complex in nature in Direct Query mode
you can refer this for additional details :
https://community.powerbi.com/t5/Desktop/Can-i-call-Stored-Procedure-with-Direct-Query/m-p/267141#:~:text=%40Pallavi%20you%20won't%20be,nature%20in%20Direct%20Query%20mode.

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.

The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

Best strategy for building Redshift Data Warehouse from multiple DBs

I need some guidance on our strategy for loading data into a Redshift Data Warehouse for analytics. We have ~40 SQL databases, each represents one customer and each database is identical. I have a SQL database with the same table structure as the 40 but each table has an additional column called "customer" that will capture where that record came from. We do some additional ETL processing with the records as they come in.
In total we have about 50 GB of data across all 40 DBs. Looking into the recommended processes for Updating / Inserting data on AWS's site they recommend creating the scratch table then merging data. I could do this but I could also just drop all the data from a table and re-load it since I am reading from the source every time. What is the recommended way to handle this?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js