I am using BigQuery Python API to create table, and would like to set an expiration date to the table, so the table would be automatically dropped after certain days.
Here is my code:
client = bq.Client()
job_config = bq.QueryJobConfig()
dataset_id = dataset
table_ref = client.dataset(dataset_id).table(filename)
job_config.destination = table_ref
job_config.write_disposition = 'WRITE_TRUNCATE'
dt = datetime.now() + timedelta(seconds=259200)
unixtime = (dt - datetime(1970,1,1)).total_seconds()
expiration_time = unixtime
job_config.expires = expiration_time
query_job = client.query(query, job_config=job_config)
query_job.result()
The problem is that the expiration parameter doesn't seem to work. When I am checking the table detail in the UI, the expiration date is still Never.
To answer a slightly different question, instead of specifying the expiration as part of the request options, you can use a CREATE TABLE statement instead, where the relevant option is expiration_timestamp. For example:
CREATE OR REPLACE TABLE my_dataset.MyTable
(
x INT64,
y FLOAT64
)
OPTIONS (
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY)
);
This creates a table with two columns that will expire three days from now. CREATE TABLE supports an optional AS SELECT clause, too, if you want to create the table from the result of a query (the documentation goes into more detail).
To update an existing table expiration time with Python:
import datetime
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("project.dataset.table")
table.expires = datetime.datetime.now() + datetime.timedelta(days=1)
client.update_table(table, ['expires'])
Credits: /u/ApproximateIdentity
Looking at the docs for the query method we can see that it's not possible to set an expiration time in the query job config.
The proper way of doing so is setting at the Table resource, something like:
client = bq.Client()
job_config = bq.QueryJobConfig()
dataset_id = dataset
table_ref = client.dataset(dataset_id).table(filename)
table = bq.Table(table_ref)
dt = datetime.now() + timedelta(seconds=259200)
table.expires = dt
client.create_table(table)
query_job = client.query(query, job_config=job_config)
query_job.result()
Related
I wrote a query for one of my Big Query table called historical and I would like to copy the result of this query into a new Big Query table called historical_recent. I have difficulties to figure out how to do this operation with Python. Right now, I am able to execute my query and get the expected result:
SELECT * FROM gcp-sandbox.dailydev.historical WHERE (date BETWEEN '2015-11-05 00:00:00' AND '
2015-11-07 23:00:00')
I am also able to copy a my Big Query table without making any changes with this script:
from google.cloud import bigquery
client = bigquery.Client()
job = client.copy_table(
'gcp-sandbox.dailydev.historical',
'gcp-sandbox.dailydev.historical_copy')
How can I combine both using Python?
You can use INSERT statement as in below example
INSERT `gcp-sandbox.dailydev.historical_recent`
SELECT *
FROM `gcp-sandbox.dailydev.historical`
WHERE date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00'
Using Python to save your query result.
from google.cloud import bigquery
client = bigquery.Client()
# Target table to save results
table_id = "gcp-sandbox.dailydev.historical_recent"
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
use_legacy_sql=True
)
sql = """
SELECT * FROM gcp-sandbox.dailydev.historical
WHERE (date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00')
"""
query = client.query(sql, job_config=job_config)
query.result()
print("Query results loaded to the table {}".format(table_id))
This example is based on the Google documentation.
I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?
I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.
It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:
Assuming you created a crawler and inserted the source on your AWS Glue job like this:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")
I have an athena table with partition based on date like this:
20190218
I want to delete all the partitions that are created last year.
I tried the below query, but it didnt work.
ALTER TABLE tblname DROP PARTITION (partition1 < '20181231');
ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231');
According to https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html, ALTER TABLE tblname DROP PARTITION takes a partition spec, so no ranges are allowed.
In Presto you would do DELETE FROM tblname WHERE ..., but DELETE is not supported by Athena either.
For these reasons, you need to do leverage some external solution.
For example:
list the files as in https://stackoverflow.com/a/48824373/65458
delete the files and containing directories
update partitions information (https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html should be helpful)
While the Athena SQL may not support it at this time, the Glue API call GetPartitions (that Athena uses under the hood for queries) supports complex filter expressions similar to what you can write in a SQL WHERE expression.
Instead of deleting partitions through Athena you can do GetPartitions followed by BatchDeletePartition using the Glue API.
this is the script the does what Theo recommended.
import json
import logging
import awswrangler as wr
import boto3
from botocore.exceptions import ClientError
logging.basicConfig(level=logging.INFO, format=logging.BASIC_FORMAT)
logger = logging.getLogger()
def delete_partitions(database_name: str, table_name: str):
client = boto3.client('glue')
paginator = client.get_paginator('get_partitions')
page_count = 0
partition_count = 0
for page in paginator.paginate(DatabaseName=database_name, TableName=table_name, MaxResults=20):
page_count = page_count + 1
partitions = page['Partitions']
partitions_to_delete = []
for partition in partitions:
partition_count = partition_count + 1
partitions_to_delete.append({'Values': partition['Values']})
logger.info(f"Found partition {partition['Values']}")
if partitions_to_delete:
response = client.batch_delete_partition(DatabaseName=database_name, TableName=table_name,
PartitionsToDelete=partitions_to_delete)
logger.info(f'Deleted partitions with response: {response}')
else:
logger.info('Done with all partitions')
def repair_table(database_name: str, table_name: str):
client = boto3.client('athena')
try:
response = client.start_query_execution(QueryString='MSCK REPAIR TABLE ' + table_name + ';',
QueryExecutionContext={'Database': database_name}, )
except ClientError as err:
logger.info(err.response['Error']['Message'])
else:
res = wr.athena.wait_query(query_execution_id=response['QueryExecutionId'])
logger.info(f"Query succeeded: {json.dumps(res, indent=2)}")
if __name__ == '__main__':
table = 'table_name'
database = 'database_name'
delete_partitions(database_name=database, table_name=table)
repair_table(database_name=database, table_name=table)
Posting the Glue API workaround for Java to save some time for these who need it:
public void deleteMetadataTablePartition(String catalog,
String db,
String table,
String expression) {
GetPartitionsRequest getPartitionsRequest = new GetPartitionsRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withExpression(expression);
List<PartitionValueList> partitionsToDelete = new ArrayList<>();
do {
GetPartitionsResult getPartitionsResult = this.glue.getPartitions(getPartitionsRequest);
List<PartitionValueList> partitionsValues = getPartitionsResult.getPartitions()
.parallelStream()
.map(p -> new PartitionValueList().withValues(p.getValues()))
.collect(Collectors.toList());
partitionsToDelete.addAll(partitionsValues);
getPartitionsRequest.setNextToken(getPartitionsResult.getNextToken());
} while (getPartitionsRequest.getNextToken() != null);
Lists.partition(partitionsToDelete, 25)
.parallelStream()
.forEach(partitionValueList -> {
glue.batchDeletePartition(
new BatchDeletePartitionRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withPartitionsToDelete(partitionValueList));
});
}
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.
How can I use current datetime as parameters in a power bi query statement for REST API? I specified two parameters "DateStart" and "DateEnd" which I want to include in my Data source's SQL statement,i want to add DateEnd as current system datetime & DateStart as 15 min past datetime.
my query is
let
body = Text.ToBinary("{
""Type"": ""Feedbacks"",
""FromDate"": ""01-01-2015 23:00:00.000"",
""ToDate"": ""20-09-2017 23:00:00.000"",
""SearchField"": ""test"",
""SearchFieldValue"": ""*""
}"),
actualUrl = "http:/xx.xx.xx.xx:xx/service/GetSomething",
options = [
Headers =[#"Content-type"="application/json"],
Content=body
],
result = Web.Contents(actualUrl, options),
#"Imported JSON" = Json.Document(result,65001)
in
#"Imported JSON"
Here i want to add end date as current datetime & start date as current-15 min
Thanks in advance.