Hash distribution on identity column - azure-sqldw

Creating a table in Azure SQL Data Warehouse, I would like to make a hash distribution on an identity column, but get an error that
Cannot insert explicit value for identity column in table 'Table_ff4d8c5d544f4e26a31dbe71b44851cb_11' when IDENTITY_INSERT is set to OFF.
Is this not possible? And if not, why? And is there a work-around? (And where does this odd table name come from?)
Thanks!

You cannot use an IDENTITY column as the hash distributed column in your table.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-identity#limitations
In SQLDW the name you give to your table is its logical name not its physical name. Logical metadata such as table names is maintained centrally on the control node so that operations such as table renames are quick and painless. However, SQLDW is still bound by the rules of table creation - we need to make sure the table name is unique both now and in the future. Therefore the physical names contain guids to deliver that uniqueness.
Saying that, the error you have here is not ideal. It would be helpful if you can post a repro so that we can improve the experience for you.
You are also welcome to post a feature request on our uservoice channel for hash distribution on the IDENTITY column. https://feedback.azure.com/forums/307516-sql-data-warehouse

Related

Informatica Power Center - ERROR: "Target table [TABLE_NAME] has no keys specified."

everyone,
I've a problem in Informatica PowerCenter.
In my mapping I have 5 objects:
1x Source Table
1x Source Qualifier
1x Expression Transformation
1x Update Strategy
1x Target Table
The source and target table have no primary key, how come Informatica PowerCenter expects a key?
I have tried changing the "Treat source rows as" property of my workflow session from "Insert" to "Data driven" and it is working.
You have an update strategy in your mapping. Which expects you must have some key defined on target. Infa fires query like
UPDATE tgt SET col =? WHERE KEY = ?
Question mark 1 is updated column and question mark 2 is the key.
You can set unique keys as primary key.
If you don't have a primary or unique keys in target, pls define all columns as keys except the updatable column.
Or, you can use target overwrite to write sql to update target, but here too, you have to set similar query like above.
Data driven should be set.
In Informatica, the ports marked as keys in Target Transformation indicate what should be used to build the Update statement in DB. It has nothing physically to do with real Primary Key defined in the database itself. Usually you use same columns as keys in Informatica and in DB - but this is not necessary. DB is unaware of what is set in Informatica and vice versa.
It's even perfectly valid to have same database table defined multiple times in Informatica and have different mappings that will update the data using different columns as keys.
Note however that if you use Update Strategy you have to define which columns to use as keys.

Automatically generate data documentation in the Redshift cluster

I am trying to automatically generate a data documentation in the Redshift cluster for all the maintained data products, but I am having trouble to do so.
Is there a way to fetch/store metadata about tables/columns in redshift directly?
Is there also some automatic way to determine what are the unique keys in a Redshift table?
For example an ideal solution would be to have:
Table location (cluster, schema, etc.)
Table description (what is the table for)
Each column's description (what is each column for, data type, is it a key column, if so what type, etc.)
Column's distribution (min, max, median, mode, etc.)
Columns which together form a unique entry in the table
I fully understand that getting the descriptions automatically is pretty much impossible, but I couldn't find a way to store the descriptions in redshift directly, instead I'd have to use 3rd party solutions or generally a documentation outside of the SQL scripts, which I'm not a big fan of, due to the way the data products are built right now. Thus having a way to store each table's/column's description in redshift would be greatly appreciated.
Amazon Redshift has the ability to store a COMMENT on:
TABLE
COLUMN
CONSTRAINT
DATABASE
VIEW
You can use these comments to store descriptions. It might need a bit of table joining to access.
See: COMMENT - Amazon Redshift

Dist/Sort key for Redshift time series database

I am involved in a time series telemetry project, where we store data into Amazon Redshift. We have a timestamp column for collection time. And ClientID, IOt-ID indicating a unique IOT device within a client.
All our queries are time bound in the sense we query for a particular day/week/month. Would the following be a good dist/sort key ?
Distribution key - (Clientid, IOT-ID)
Sort key - timestamp
The general rule for Amazon Redshift is:
Set the Distribution Key to the field normally used to JOIN with other tables. This will put all data for a given value of that column on the same slice, making it easier to JOIN with other tables that have the same DISTKEY.
Set the Sort Key to the field that is most commonly used in a WHERE statement. Rows will be stored in order of this field, making it easier to "skip over" disk blocks that do not contain the desired data. (This is very powerful.)
So, it sounds like your timestamp field is ideal as the SORTKEY.
The choice of DISTKEY depends on how you JOIN, but can also help GROUP BY since the relevant data is co-located.

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

Can you add a global secondary index to dynamodb after table has been created?

With an existing dynamodb table, is it possible to modify the table to add a global secondary index? From the dynamodb control panel, it looks like I have to delete the table and create a new one with the global index.
Edit (January 2015):
Yes, you can add a global secondary index to a DynamoDB table after its creation; see here, under "Global Secondary Indexes on the Fly".
Old Answer (no longer strictly correct):
No, the hash key, range key, and indexes of the table cannot be modified after the table has been created. You can easily add elements that are not hash keys, range keys, or indexed elements after table creation, though.
From the UpdateTable API docs:
You cannot add, modify or delete indexes using UpdateTable. Indexes can only be defined at table creation time.
To the extent possible, you should really try to anticipate current and future query requirements and design the table and indexes accordingly.
You could always migrate the data to a new table if need be.
Just got an email from Amazon:
Dear Amazon DynamoDB Customer,
Global Secondary Indexes (GSI) enable you to perform more efficient
queries. Now, you can add or delete GSIs from your table at any time,
instead of just during table creation. GSIs can be added via the
DynamoDB console or a simple API call. While the GSI is being added or
deleted, the DynamoDB table can still handle live traffic and provide
continuous service at the provisioned throughput level. To learn more
about Online Indexing, please read our blog or visit the documentation
page for more technical and operational details.
If you have any questions or feedback about Online Indexing, please
email us.
Sincerely, The Amazon DynamoDB Team
According to the latest new from AWS, GSI support for existing tables will be added soon
Official statement on AWS forum