How can we decide a particular table need to replicate/Partitioning ?
Use of Replication tables in GemfireXD data base design ?
All explanations are the most welcome.
A table should be REPLICATE in following scenario
If it is a Code table. A code table is a list of closely related items, each of which has minimal substructure.
If it is a small table in size
If application requires query with joins involving very large and small tables, then smaller tables must be REPLICATED
Please go through the following links:
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#data_management/replication-deciding.html
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#data_management/partitioning.html#concept_08031D9C1AEA48F1B4B021CA1BE3ABF5
Related
I wanted to know what would be the best approach for creating the dim tables. Can I maintain it as a single table with all fields and use them as required or create separate dim tables and use them individually.
Can someone please help me out here
PS: I'm a beginner here.
Creating 1 table per dimension is the best practice. In data warehouse concept, you will get 4 types of schema as below-
Start Schema
Snowflakes Schema
Galaxy Schema
Combined Schema
People select any of the above based on their Data type/nature, requirement and other parameter. But in all case, there are single table per dimension. This is easy to maintain and give better performance.
Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts.
First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.
Each master table has different amounts of IDs,
the date table --> largest one (#1 most joined)
the store table --> medium one (#3 most joined)
the item table --> smallest one (#2 most joined)
Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?
Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?
I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!
You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.
However, my rule of thumb is:
The DISTKEY should be the column most used in JOINs between tables
The SORTKEY should be the column most used in WHERE statements
Use DISTSTYLE ALL for small lookup tables
I have a PowerBI report that has a few different pages display different visuals. The report uses the same table of data (lets call it Jobs).
The previous author of this report has created two queries in the data section that read off this base table of data, but apply different transformations and filters to the underlying data. Then, the visuals use either of these models to display their data. For example, the first one applies a filter to exclude certain columns based off a status field and the other applies a different filter, and performs transformations on some of the columns
When I manually refresh the report, it looks like the report is retrieving data for both of these queries, even though the base data is the same. Since the dataset is quite large, I am worried that this report has been built inefficiently but I am not sure if there is a better way of doing this.
TL;DR; The Source and Navigation of both of queries is exactly the same - is this retrieving the data twice and causing my report to be inefficient, and if so, what is the approrpiate way to achieve what I am trying to do?
PowerBi will try to parallelize as much as possible. If you have two queries that read from the same table then two queries will be executed.
To avoid this you can:
create a query which only gets the necessary data from the table.
Set this table not to be loaded in the model (toggle "Enable Load")
Every other table that starts from this table won't be a clone of this but will reference it.
In this way, the data will be fetched once from the source and then used to create other tables using PowerQuery.
I have a simple data model, from the Contoso database, that looks like this:
I'm trying to set up the table named Online Sales Aggregate as an aggregate table. When I attempt to set up a mapping, all the detail tables are disabled (see below)
When I hover over a table I see a message that says, "Customers (for example) must be a DirectQuery table to be used as the detail table."
All the tables in the model, including the Online Sales Aggregate table were imported. Why do the detail tables need to be DQ tables?
This is currently a limitation that Microsoft has imposed at least while aggregates are still in preview.
From Microsoft's documentation:
Detail table must be DirectQuery, not Import.
According to Microsoft people, it's likely that this limitation will eventually go away.
v-lili6-msft: Power bi product team is improving this preview feature
JoshCaplan-MSFT: This is still a work in progress but it is coming.
To expand on what David says below, I'd guess that removing this limitation is not a high priority since the main use case for aggregations is for datasets that are too unwieldy to import. If you've already imported all the data, then adding an aggregate table probably won't really speed things up that much in most cases.
If you still do need an aggregate table for an imported table, then you can do the workaround he describes by creating a summarized table via the query editor or a DAX calculated table and write your measure(s) to try to read from that first. An added bonus with this method is that you can use custom measures in your summarized table instead of being limited to aggregate summarization functions (Count, GroupBy, Max, Min, Sum), though you'll need to be careful with how you handle non-additive measures.
Im in the process of learning to properly pull appropriate metadata from a Teradata database and a large part of what I need is to pull all existing primary/foreign keys within a database. I am still very much a beginner with Teradata as well as big data in general, so a simplified explanation would be nice.
A simplified version of a select statement would also be incredibly helpful. Thanks in advance.
Foreign Keys: dbc.All_RI_ParentsV[X]
PK/Unique: dbc.IndicesV[X]. Unique Indexes got a UniqueFlag Y, if it was defined as a PK in the Create Table IndexType will be P. Multi-column indexes got one row per column all sharing the same IndexNumber, 1 is always the PI.
But as Teradata is a DWH you might have tables without defined PK and you will hardly find any defined FKs.