Automatically generate data documentation in the Redshift cluster - amazon-web-services

I am trying to automatically generate a data documentation in the Redshift cluster for all the maintained data products, but I am having trouble to do so.
Is there a way to fetch/store metadata about tables/columns in redshift directly?
Is there also some automatic way to determine what are the unique keys in a Redshift table?
For example an ideal solution would be to have:
Table location (cluster, schema, etc.)
Table description (what is the table for)
Each column's description (what is each column for, data type, is it a key column, if so what type, etc.)
Column's distribution (min, max, median, mode, etc.)
Columns which together form a unique entry in the table
I fully understand that getting the descriptions automatically is pretty much impossible, but I couldn't find a way to store the descriptions in redshift directly, instead I'd have to use 3rd party solutions or generally a documentation outside of the SQL scripts, which I'm not a big fan of, due to the way the data products are built right now. Thus having a way to store each table's/column's description in redshift would be greatly appreciated.

Amazon Redshift has the ability to store a COMMENT on:
TABLE
COLUMN
CONSTRAINT
DATABASE
VIEW
You can use these comments to store descriptions. It might need a bit of table joining to access.
See: COMMENT - Amazon Redshift

Related

How to add columns to an existing Athena table using Avro storage

I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.

Perform data mapping in GCP

I have data coming from multiple hotels. These hotels are not using the same naming convention for storing the order information. I have a predefined dataset created in the bigquery(called hotel_order). I
want to map the data coming from different hotels to the single dataset in GCP, so it is easier for me to do comparisons in the bigquery.
If the column name(from hotel1) matches the bigquery dataset columnname, then the bigquery should load the data in the column, if the columnnames (from hotel orders data and dataset in bigquery) don't match, then column in the bigquery should have the null value. How do I do implement this in GCP? Problem of mapping in the GCP?
If you want to join tables together, and show a null value when a match doesn't exist, then you can do so using 'left join'.
Rough example
from hotel.orders as main left join hotel_number_one as Hotel_One on main.order_information = Hotel_One.order_information
Difficult to give a more detailed answer without more details or a working example using dbfiddle.

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}
Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

Columnar database queries in Amazon Redshift

I'm learning Amazon Redshift. Heard that it is very powerful storage on cloud and works very fast on data where aggregate operations are required because it stores data column-wise.
Am not able to find any example queries? Could someone share with me some examples of Aggregate queries running on Amazon Redshift? Is it different from normal relation database queries?
You are correct -- Amazon Redshift is a columnar database. This means that data is stored on disk per column, making operations on a column very fast. For example, adding the Sales column for a particular value in the Country column only requires accessing two columns rather than all columns in a table.
Other benefits are that data in Redshift is compressed (which works well with the columnar concept, because each column uses its own compression method based on the data stored) and the fact that it is a clustered database, so compute and storage can be scaled by adding additional nodes.
Amazon Redshift presents itself as a PostgreSQL database, so you just use industry-standard SQL to query data. No changes to queries are required.
However, you can optimize Redshift by wisely choosing a Distribution Key for each table that determines how data is distributed amongst nodes, and carefully select the Sort Key, which determines how data is stored on each node. Put simply, data should be distributed by how you JOIN tables and should be sorted by what you use in WHERE statements.
As for sample queries... it totally depends upon your data! Queries look exactly the same as normal SQL.

Amazon Redshift schema design

We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.