Parse schema of a dynamic dataframe in AWS Glue - amazon-web-services

I have a dynamic dataframe in AWS glue which I created using the below piece of code.
val rawDynamicDataFrame = glueContext.getCatalogSource(
database = rawDBName,
tableName = rawTableName,
redshiftTmpDir = "",
transformationContext = "rawDynamicDataFrame"
).getDynamicFrame()
In order to get the schema of the above dynamic frame, I used the below piece of code:
val x = rawDynamicDataFrame.schema
Now x is of type com.amazonaws.services.glue.schema.Schema. How can I parse the schema object?

To check if a field exist in schema use containsField(fieldPath):
if (rawDynamicDataFrame.schema.containsField("app_name")) {
// do something
}

Maybe you can use field_names = [field.name for field in self. rawDynamicDataFrame.schema().fields] to get a list of field names.

Related

AWS Glue transform string value from postgres to json array

I am new to AWS Glue and pyspark. I have a table in RDS which contains a varchar field id. I want to map id to a String field in the output json which is inside a json array field (let's say newId):
{
 "sources" : [
  "newId" : "1234asdf"
 ]
}
How can I achieve this using the transforms defined in the pyspark script of the AWS Glue job.
Use the AWS Glue Map Transformation to map the string field into a field inside a JSON array in target.
NewFrame= Map.apply(frame=OldFrame, f=map_fields)
and define a function map_fields like such:
def map_fields(rec):
rec["sources"] = {}
rec["sources"] = [{"newID": rec["id"]}]
del rec["id"]
return rec
Make sure to delete the original field as done in del rec["uid"] otherwise the logic doesn't work.

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

AWS Athena: Delete partitions between date range

I have an athena table with partition based on date like this:
20190218
I want to delete all the partitions that are created last year.
I tried the below query, but it didnt work.
ALTER TABLE tblname DROP PARTITION (partition1 < '20181231');
ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231');
According to https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html, ALTER TABLE tblname DROP PARTITION takes a partition spec, so no ranges are allowed.
In Presto you would do DELETE FROM tblname WHERE ..., but DELETE is not supported by Athena either.
For these reasons, you need to do leverage some external solution.
For example:
list the files as in https://stackoverflow.com/a/48824373/65458
delete the files and containing directories
update partitions information (https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html should be helpful)
While the Athena SQL may not support it at this time, the Glue API call GetPartitions (that Athena uses under the hood for queries) supports complex filter expressions similar to what you can write in a SQL WHERE expression.
Instead of deleting partitions through Athena you can do GetPartitions followed by BatchDeletePartition using the Glue API.
this is the script the does what Theo recommended.
import json
import logging
import awswrangler as wr
import boto3
from botocore.exceptions import ClientError
logging.basicConfig(level=logging.INFO, format=logging.BASIC_FORMAT)
logger = logging.getLogger()
def delete_partitions(database_name: str, table_name: str):
client = boto3.client('glue')
paginator = client.get_paginator('get_partitions')
page_count = 0
partition_count = 0
for page in paginator.paginate(DatabaseName=database_name, TableName=table_name, MaxResults=20):
page_count = page_count + 1
partitions = page['Partitions']
partitions_to_delete = []
for partition in partitions:
partition_count = partition_count + 1
partitions_to_delete.append({'Values': partition['Values']})
logger.info(f"Found partition {partition['Values']}")
if partitions_to_delete:
response = client.batch_delete_partition(DatabaseName=database_name, TableName=table_name,
PartitionsToDelete=partitions_to_delete)
logger.info(f'Deleted partitions with response: {response}')
else:
logger.info('Done with all partitions')
def repair_table(database_name: str, table_name: str):
client = boto3.client('athena')
try:
response = client.start_query_execution(QueryString='MSCK REPAIR TABLE ' + table_name + ';',
QueryExecutionContext={'Database': database_name}, )
except ClientError as err:
logger.info(err.response['Error']['Message'])
else:
res = wr.athena.wait_query(query_execution_id=response['QueryExecutionId'])
logger.info(f"Query succeeded: {json.dumps(res, indent=2)}")
if __name__ == '__main__':
table = 'table_name'
database = 'database_name'
delete_partitions(database_name=database, table_name=table)
repair_table(database_name=database, table_name=table)
Posting the Glue API workaround for Java to save some time for these who need it:
public void deleteMetadataTablePartition(String catalog,
String db,
String table,
String expression) {
GetPartitionsRequest getPartitionsRequest = new GetPartitionsRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withExpression(expression);
List<PartitionValueList> partitionsToDelete = new ArrayList<>();
do {
GetPartitionsResult getPartitionsResult = this.glue.getPartitions(getPartitionsRequest);
List<PartitionValueList> partitionsValues = getPartitionsResult.getPartitions()
.parallelStream()
.map(p -> new PartitionValueList().withValues(p.getValues()))
.collect(Collectors.toList());
partitionsToDelete.addAll(partitionsValues);
getPartitionsRequest.setNextToken(getPartitionsResult.getNextToken());
} while (getPartitionsRequest.getNextToken() != null);
Lists.partition(partitionsToDelete, 25)
.parallelStream()
.forEach(partitionValueList -> {
glue.batchDeletePartition(
new BatchDeletePartitionRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withPartitionsToDelete(partitionValueList));
});
}

Django-restframework with pandas

I am using django-restframework,pandas,django-pandas to make an api and i am getting the following output
This is the data of a single user and each label represents a column name but I want the output in the following format
can anyone help to get the data in the desired format
My code is
views.py
#api_view(['GET'])
def my_view(request,id):
qs = Health.objects.filter(id = id)
df = read_frame(qs)
df['x-Mean'] = abs(df['Age'] - df['Age'].mean())
df['1.96*std'] = 1.96*df['Age'].std()
df['Outlier'] = abs(df['Age'] - df['Age'].mean()) > 1.96*df['Age'].std()
df['bmi'] = df['Weight']/(df['Height']/100)**2
a = df.fillna(0)
return Response(a)
This is happening because a is a pandas.DataFrame and it corresponds to table, so during serialization it tries to represent all data for each table column. DataFrame does not know that you have only one value for each column.
Values have to be extracted manually:
a = {column: values[0] for column, values in df.fillna(0).to_dict().items(orient='list')}
return Response(a)
For more details check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

How to replace with the new tag values in the table?

I have a Event table and a Tags table:
tags = Tags.objects.filter(event_id=id).values_list('name')
gives me list of existing values
oldlist = {"tags":"['tag1', 'tag4', 'tag3']"}
I have new list
newlist = {"tags":"['tag1', 'tag2', 'tag3']"}
I have tables as:
Table: Event
id, title, content
Table: Tags
id event_id, name
while user is updating an event he can update the tags also, how to store these new tags and replace them with the previous once?
tagobj = Tags.objects.filter(event_id = id)
if len(tags) > 0:
for i in range(len(tags)):
tagsobj = tagobj.update(name = tags[i], event_id = key)
the above code is updating but only storing the last value in the table. I need to update the values with the new ones.
The list can contain any number of values.I need to replace the new list with the existing ones in the database.
Simply I just want to update the new tags with the oldones.How to do it?
The issue is with the following lines:
for i in range(len(tags)):
tagsobj = tagobj.update(name = tags[i], event_id = key)
You are calling update on a QuerySet tagobj, which will update all tags in the QuerySet. So all your tags will have the value of the last tag.
If I understand your question correctly, it should work if you update each individual tag.
i = 0
for tag_item in tagobj:
tag_item.update(name=tags[i], event_id=key)
i +=1