BQ Get labels from information schema - google-cloud-platform

I need to get the labels of all the BQ tables in a project.
Currently the only way I found is to loop over all the tables and retrieve the labels.
tables = client.list_tables(dataset_id)
for table in tables:
if table.labels:
for label, value in table.labels.items():
This approach works but is time consuming.
Is there any possibility to get the labels using a unique BQ query?
INFORMATION_SCHEMA.TABLES doesn't return the labels.

You can define an option to return the labels from the INFORMATION SCHEMA.
SELECT
*
FROM
INFORMATION_SCHEMA.SCHEMATA_OPTIONS
WHERE
schema_name = 'schema'
AND option_name = 'labels';

Related

Query for listing Datasets and Number of tables in Bigquery

So I'd like make a query that shows all the datasets from a project, and the number of tables in each one. My problem is with the number of tables.
Here is what I'm stuck with :
SELECT
smt.catalog_name as `Project`,
smt.schema_name as `DataSet`,
( SELECT
COUNT(*)
FROM ***DataSet***.INFORMATION_SCHEMA.TABLES
) as `nbTable`,
smt.creation_time,
smt.location
FROM
INFORMATION_SCHEMA.SCHEMATA smt
ORDER BY DataSet
The view INFORMATION_SCHEMA.SCHEMATA lists all the datasets from the project the query is executed, and the view INFORMATION_SCHEMA.TABLES lists all the tables from a given dataset.
The thing is that the view INFORMATION_SCHEMA.TABLES needs to have the dataset specified like this give the tables informations : dataset.INFORMATION_SCHEMA.TABLES
So what I need is to replace the *** DataSet*** by the one I got from the query itself (smt.schema_name).
I am not sure if I can do it with a sub query, but I don't really know how to manage to do it.
I hope I'm clear enough, thanks in advance if you can help.
You can do this using some procedural language as follows:
CREATE TEMP TABLE table_counts (dataset_id STRING, table_count INT64);
FOR record IN
(
SELECT
catalog_name as project_id,
schema_name as dataset_id
FROM `elzagales.INFORMATION_SCHEMA.SCHEMATA`
)
DO
EXECUTE IMMEDIATE
CONCAT("INSERT table_counts (dataset_id, table_count) SELECT table_schema as dataset_id, count(table_name) from ", record.dataset_id,".INFORMATION_SCHEMA.TABLES GROUP BY dataset_id");
END FOR;
SELECT * FROM table_counts;
This will return something like:

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

Pyspark with AWS Glue join on multiple columns creating duplicates

I have two tables in AWS Glue, table_1 and table_2 that have almost identical schemas, however, table_2 has two additional columns. I am trying to join these two tables together on the columns that are the same and add the columns that are unique to table_2 with null values for the "old" data whose schema does not include those values.
Currently, I am able to join the two tables, using something similar to:
joined_table = Join.apply(table_1, table_2, 'id', 'id')
where the first 'id' is the id column in table_1 and the second 'id' is the id column in table_2. This call successfully joins the table into one, however, the resulting joined_table has duplicate fields for the matching columns.
My two questions are:
How can I leverage AWS Glue job with Pyspark to join all columns that match across the two tables so that there are not duplicate columns and while adding the new fields?
This sample call only takes in the 'id' column as I was trying to get this just to work, however, I want to pass in all the columns that match across the two tables. How can I pass in a list of columns to this Join.apply call? I am aware of the available methods from Pyspark directly, however, am wondering if there is a way specific to AWS Glue jobs or if there is something I need to do within AWS Glue to leverage Pyspark functionality directly.
I found that I needed to rename the columns in table_1 and then was missing a call to .drop_fields after my Join.apply call to remove the old columns from the joined table.
Additionally, you can pass in a list of column names rather than the single 'id' column that I was trying to use in the question.
joineddata = Join.apply(frame1 = table1, frame2 = table2, keys1 = ['id'], keys2 = ['id'], transformation_ctx = 'joinedData')
The join in aws glue doesn't handle duplicates. You need to convert to dataframes and then drop duplicate.
If you have duplicates, Try this:
selectedFieldsDataFrame = joineddata.toDF()
selectedFieldsDataFrame.dropDuplicates()

How can I check the partition list from Athena in AWS?

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?
In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'

SyncFramework: How to sync all columns from a table?

I create a program to sync tables between 2 databases.
I use this common code:
DbSyncScopeDescription myScope = new DbSyncScopeDescription("myscope");
DbSyncTableDescription tblDesc = SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", onPremiseConn);
myScope.Tables.Add(tblDesc);
My program creates the tracking table only with Primary Key (id column).
The sync is ok to delete and insert rows.
But updating don't. I need update all the columns and they are not updated (For example: a telephone column).
I read that I need to add the columns I want to sync MANUALLY with this code:
Collection<string> includeColumns = new Collection<string>();
includeColumns.Add("telephone");
...
includeColumns.Add(Last column);
And changing the table descripcion in this way:
DbSyncTableDescription tblDesc = SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", includeColumns, onPremiseConn);
Is there a way to add all the columns of the table automatically?
Something like:
Collection<string> includeColumns = GetAllColums("Table");
Thanks,
SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", onPremiseConn) will include all the columns of the table already.
the tracking tables only stores the PK and filter columns and some Sync Fx specific columns.
the tracking is at row level, not column level.
during sync, the tracking table and its base table are joined to get the row to be synched.