How can I check the partition list from Athena in AWS? - amazon-web-services

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?

In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'

Related

BQ Get labels from information schema

I need to get the labels of all the BQ tables in a project.
Currently the only way I found is to loop over all the tables and retrieve the labels.
tables = client.list_tables(dataset_id)
for table in tables:
if table.labels:
for label, value in table.labels.items():
This approach works but is time consuming.
Is there any possibility to get the labels using a unique BQ query?
INFORMATION_SCHEMA.TABLES doesn't return the labels.
You can define an option to return the labels from the INFORMATION SCHEMA.
SELECT
*
FROM
INFORMATION_SCHEMA.SCHEMATA_OPTIONS
WHERE
schema_name = 'schema'
AND option_name = 'labels';

Filter Dynamo DB rows

I want to filter all dynamo db rows where 2 columns have same value
table = client.Table('XXX')
response = table.query(
KeyConditionExpression=Key('column1').eq(KeyConditionExpression=Key('column2'))
)
this is wrong as we can't pass KeyConditionExpression inside eq statement. I don't want to scan through all the rows and filter the rows.
Scanned through multipole resources and answers but every resources talks about the multiple column checking with some value not multiple condition involving columns
Is there anyway we can achieve this?
Yes, this is possible.
If you want query over all records you need to use scan, if you want to query only records with one specific partition key you can use query.
For both you can use a FilterExpression, which will filter the records after retrieving them from the database, but before returning them to the user (so beware, using a scan with this will read all your records).
A scan from the CLI could look like this:
aws dynamodb scan \
--table-name your-table \
--filter-expression "#i = #j" \
--expression-attribute-names '{"#i": "column1", "#j": "column2"}'
Create a Global Secondary Index with a partition key of 'Column1Value#Column2Value'
Then it's simply a matter of querying the GSI.

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

AWS Glue Dynamic_frame with pushdown predicate not filtering correctly

I am writing an script for AWS Glue that is sourced in S3 stored parquet files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in.
The table partitions are (in order): account_id > region > vpc_id > dt
And the code for creating the dynamic_frame is the following:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "dt='" + DATE + "'")
where DATE = '2019-10-29'
However it seems that Glue still attempts to read data from other days. Maybe it's because I have to specify a push_down_predicate for the other criteria?
As per the comments, the logs show that the date partition column is marked as "dt" where as in your table it is being referred by the name "date"
Logs
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY/dt=2019-07-15
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-03
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-08-27
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-29 ...
Your Code
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "date='" + DATE + "'")
Change the date partitions column name to dt in your table and same in push_down_predicate parameter in the above code.
I also see extra forward slashes in some of the paths in above logs, were these partitions added manually through athena using ALTER TABLE command? If so, I would recommend to use MSCK REPAIR command to load all partitions in the table to avoid such issues. Extra blank slashes in S3 path some times lead to errors while doing ETL through spark.

Automatically update athena partition - MSCK Repair

I have a inventory Bucket - inside the bucket - I have 6 Folders.
In Athena for each 6 folders - i have 6 tables in athena.
Now i have to update the paritions - as and when a file is dropped into any one of the 6 folders.
How do i write multiple sql (6 Sql) in one lambda for s3 event trigger.
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
Database is same ; however i have 6 different tables. I have to update all 6 tables.
First I would check the key of the dropped file and only update the table that points to the prefix where the file was dropped. E.g. if your folders and tables are prefix0, prefix1, prefix2, etc. and the dropped file has the key prefix1/some-file you update only the table with the location prefix1. There is no need to update the other tables, their data hasn't changed.
However, I would suggest not using MSCK REPAIR TABLE for this. That command is terrible in almost every possible way. It's wildly inefficient and its performance becomes worse and worse as you add more objects to your table's prefix. It doesn't look like you wait for it to complete in your Lambda, so at least you're not paying for its inefficiency, but there are much better ways to add partitions.
You can use the Glue APIs directly (under the hoods Athena tables are tables in the Glue catalog), but that is actually a bit complicated to show since you need to specify a lot of metadata (a downside of the Glue APIs).
I would suggest that instead of the MSCK REPAIR TABLE … call you do ALTER TABLE ADD PARTITION …:
Change the line
sql = 'MSCK REPAIR TABLE some_database.some_table'
to
sql = 'ALTER TABLE some_database.some_table ADD IF NOT EXISTS PARTITION (…) LOCATION \'s3://…\''
The parts where it says … you will have to extract from the object's key. If your keys look like s3://some-bucket/pk0=foo/pk1=bar/object.gz and your table has the partition keys pk0 and pk1 the SQL would look like this:
ALTER TABLE some_database.some_table
ADD IF NOT EXISTS
PARTITION (pk0 = 'foo', pk1 = 'bar') LOCATION 's3://some-bucket/pk0=foo/pk1=bar/'