Athena equivalent to information_schema - amazon-athena

For background, I come from a SQLServer background and make heavy use of the system tables & information_schema, to tell me all about my tables and columns.
I didn't expect the exact same power in Athena, but currently very shocked and frustrated with what little seems to be available - unless I've missed something ?
For example, 'describe mytable' - just describes 1 table at a time.
How about showing the columns for ALL tables in one result ?
It also does not output the table name, nor allow you to manually add that in as a custom column.
All the results of these "show/list/describe" commands seem to produce a text list - not a recordset, so you cannot take the results and join them to other tables or views to make more complex outputs.
Is there any other way to query the contents of my databases ?
Thanks in advance

Athena is based on Presto. Presto provides information_schema schema and I checked and it is accessible in Athena.
You can run e.g. a query like:
SELECT * FROM information_schema.columns;
to get a list of columns of all tables.
You can filter this by "database":
SELECT * FROM information_schema.columns WHERE table_schema = '<databasename>';
Note however that these types of queries are not necessarily very performant.

Related

Does the syntax in a Power BI join cause a data refresh?

I'm trying to make a Power BI report that someone else created run faster and as I'm going through the queries I've noticed some of the merged queries have different syntax and I'm wondering if the different syntax is causing a data refresh to occur during the merge.
Below are 2 different merged queries, but one has the # sign before the table name with the table name in quotes and the other does not. What is the significance of not having the # sign?
It's the #"Org_Roll-Up" vs Account_Groups.
Syntax 1
= Table.NestedJoin(#"Changed Type9", {"COMPANY"}, #"Org_Roll-Up", {"ORG"}, "Org_Roll-Up", JoinKind.LeftOuter)
Syntax 2
= Table.NestedJoin(#"Removed Columns", {"ACCOUNT"}, Account_Groups, {"ACCOUNT"}, "Account_Groups", JoinKind.LeftOuter)
I'm trying to get the queries to run once and then send the data to other queries as needed instead of refreshing each time. I have parallel turned of and background data refresh off.
The # syntax makes zero difference. Variable names require a # when they have spaces or special characters in them otherwise they're not required. See here for more details.
https://bengribaudo.com/blog/2018/01/19/4321/power-query-m-primer-part4-variables-identifiers
The way you’ve worded this, I just want to make sure you understand how Power Query does a merge. When you merge query A with query B, Power Query will run query B again for use by query A. It doesn’t pull previously loaded data by query B or table B from the data model (nor does it change table B). It will run the query B again and join it with query A and load the result to table A. So in syntax 1, #"Org_Roll-Up" will re-run, and in syntax 2, Account_Groups will re-run. Depending on what the query does and how many rows are on the table, you can have quite different performance through small changes. See Chris Webb’s 3 part series here for more ideas: https://blog.crossjoin.co.uk/2020/05/31/optimising-the-performance-of-power-query-merges-in-power-bi-part-1/

Create a table using `pg_table_def` data in Redshift or DBT

To create a table from all the data in pg_table_def that is visible to my user, I tried:
create table adhoc_schema.pg_table_dump as (
select *
from pg_table_def
);
But it throws an error:
Column "schemaname" has unsupported type "name"
Any way to create a table from pg_table_def or information_schema.columns?
Found this from another thread, which seems like it would help.
Amazon considers the internal functions that INFORMATION_SCHEMA.COLUMNS is using Leader-Node Only functions. Rather than being sensible and redefining the standardized INFORMATION_SCHEMA.COLUMNS, Amazon sought to define their own proprietary version. For that they made available another function PG_TABLE_DEF which seems to address the same need. Pay attention to the note in the center about adding the schema to search_path.
Stores information about table columns.
PG_TABLE_DEF only returns information about tables that are visible to the user. If PG_TABLE_DEF does not return the expected results, verify that the search_path parameter is set correctly to include the relevant schemas.
You can use SVV_TABLE_INFO to view more comprehensive information about a table, including data distribution skew, key distribution skew, table size, and statistics.
So using your example code (rewritten to use NOT EXISTS for clarity),
SET SEARCH_PATH to '$user', 'public', 'target_schema';
SELECT "column"
FROM dev.fields f
WHERE NOT EXISTS (
SELECT 1
FROM PG_TABLE_DEF pgtd
WHERE pgtd.column = f.field
AND schemaname = 'target_schema'
);
See also,
Official docs on Querying Redshift System Tables: https://docs.aws.amazon.com/redshift/latest/dg/t_querying_redshift_system_tables.html
pg_table_def is a leader-node system table with some additional data types not supported in user tables. You will need to cast to text first.
However, this still won't work because pg_table_def is a leader-node table and the table you are creating is a user table stored on the compute nodes. You will need to pull your source information from tables that are stored on compute-nodes. Since I don't know what information you are looking for from pg_table_def I cannot say exactly which ones you need but you can start with stv_tbl_perm and join in pg_class and other tables as more info is needed.

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

In power query language(M language) how can we add custom "value" and "table" columns to a table manually?

In power query if we get data from an sql database, "Value" and "Table" columns are created automatically if there are relationships in the database.
AFAIK "Table" and "Value" means one-to-many and many-to-one relationships respectively.
My problem is that there are no relationships in our database. So PowerQuery cannot generate these columns automatically. How can I manually add these columns if I know the relationships between the subject tables?
I found Table.NestedJoin function which returns Table object(but with low performance, even though there are relationships in the database.)
But I could not find any function which returns a Value object(record of another table).
Possible other solutions with flaws are;
You may advise that I get the tables as in the database and create relationships in Relationships section in Power BI(or in power pivot section in Excel). But I need this Value object in power query because I would like to filter the rows according to the related table before loading all the rows of the table.
Creating a native query which joins the tables which is not my preference.
Creating Table object instead of a Value object(we are sure that only one record will come.) Still I have a performance problem with Table.NestedJoin method. Is there another option?
Thanks in advance...
Just today I had quite same issue with performance, but finally solved it. In my solution I work with views, but need to filter records coming.
When I use such a code:
let
filter1 = 2016,
filter2 = "SomeText",
tbl = Sql.Database("MyServer","MyDB"){Schema="dbo",Item="MyTableOrView"}[Data],
filteredTable = Table.SelectRows(tbl, each ([field1] = filter1) and ([field2] = filter2))
in
filteredTable
it works slow. But if I try NestedJoin - it performs much better.
let
Source = Table.FromColumns({{2016}, {"SomeText"}}, "filter1", "filter2"),
tbl = Sql.Database("MyServer","MyDB"){Schema="dbo",Item="MyTableOrView"}[Data],
filteredTable = Table.NestedJoin(tbl, {"field1", "field2"}, Source, {"filter1", "filter2"}, "NewColumn", JoinKind.Inner)
in
filteredTable
However, I noticed that even fastest design I got works slower than just a query that returns all ~~1300 rows from the view.
I have no SQL Profiler to track down what is exactly sent to the server, but it seems to me that query folding work when you use inner joins.
Try following: make 2 queries to 2 tables (no other actions!) and inner join them, then see if it works faster.

Power Query - Select Columns from table instead of removing afterwards

The default behaviour when importing data from a database table (such as SQL Server) is to bring in all columns and then select which columns you would like to remove.
Is there a way to do the reverse? ie Select which columns you want from a table? Preferably without using a Native SQL solution.
M:
let
db = Sql.Databases("sqlserver.database.url"){[Name="DatabaseName"]}[Data],
Sales_vDimCustomer = db{[Schema="Sales",Item="vDimCustomer"]}[Data],
remove_columns = Table.RemoveColumns(Sales_vDimCustomer,{"Key", "Code","Column1","Column2","Column3","Column4","Column5","Column6","Column7","Column8","Column9","Column10"})
in
remove_columns
The snippet above shows the connection and subsequent removal.
Compared to the native SQL way way:
= Sql.Database("sqlserver.database.url", "DatabaseName", [Query="
SELECT Name,
Representative,
Status,
DateLastModified,
UserLastModified,
ExtractionDate
FROM Sales.vDimCustomer
"])
I can't see much documentation on the }[Data], value in the step so was hoping maybe that I could hijack that field to specify which fields from that data.
Any ideas would be great! :)
My first concern is that when this gets compiled down to SQL, it gets sent as two queries (as watched in ExpressProfiler).
The first query removes the selected columns and the second selects all columns.
My second concern is that if a column is added to or removed from the database then it could crash my report (additional columns in Excel Tables jump your structured table language formulas to the wrong column). This is not a problem using Native SQL as it just won't select the new column and would actually crash if the column was removed which is something I would want to know about.
Ouch that was actually easy after I had another think and a look at the docs.
let
db = Sql.Databases("sqlserver.database.url"){[Name="DatabaseName"]}[Data],
Sales_vDimCustomer = Table.SelectColumns(
(db{[Schema="Sales",Item="vDimCustomer"]}[Data],
{
"Name",
"Representative",
"Status",
"DateLastModified",
"UserLastModified",
"ExtractionDate"
}
)
in
Sales_vDimCustomer
This also loaded much faster than the other way and only generated one SQL requested instead of two.