How to use AVRO format on AWS Glue/Athena - amazon-web-services

I've a few topics in Kafka that are writing AVRO files into S3 buckets and I would like to perform some queries on bucket using AWS Athena.
I'm trying to create a table but AWS Glue crawler runs and doesn't add my table (it works if I change file type to JSON). I've tried to create a table from Athena console but it doesn't show support to AVRO file.
Any idea on how to make it work?

I suggest doing it manually and not via Glue. Glue only works for the most basic situations, and this falls outside that, unfortunately.
You can find the documentation on how to create an Avro table here: https://docs.aws.amazon.com/athena/latest/ug/avro.html
The caveat for Avro tables is that you need to specify both the table columns and the Avro schema. This may look weird and redundant, but it's how Athena/Presto works. It needs a schema to know how to interpret the files, and then it needs to know which of the properties in the files you want to expose as columns (and their types, which may or may not match the Avro types).
CREATE EXTERNAL TABLE avro_table (
foo STRING,
bar INT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal' = '
{
"type": "record",
"name": "example",
"namespace": "default",
"fields": [
{
"name": "foo",
"type": ["null", "string"],
"default": null
},
{
"name": "bar",
"type": ["null", "int"],
"default": null
}
]
}
')
STORED AS AVRO
LOCATION 's3://some-bucket/data/';
Notice how the Avro schema appears as a JSON document inside of a serde property value (single quoted) – the formatting is optional, but makes this example easier to read.

Doing it manually seems to be the way to make it work.
Here is some code to generate the Athena schema directly from a literal avro schema. It works with avro-python3 on python3.7. It is taken from here: https://github.com/dataqube-GmbH/avro2athena (I am the owner of the repo)
from avro.schema import Parse, RecordSchema, PrimitiveSchema, ArraySchema, MapSchema, EnumSchema, UnionSchema, FixedSchema
def create_athena_schema_from_avro(avro_schema_literal: str) -> str:
avro_schema: RecordSchema = Parse(avro_schema_literal)
column_schemas = []
for field in avro_schema.fields:
column_name = field.name.lower()
column_type = create_athena_column_schema(field.type)
column_schemas.append(f"`{column_name}` {column_type}")
return ', '.join(column_schemas)
def create_athena_column_schema(avro_schema) -> str:
if type(avro_schema) == PrimitiveSchema:
return rename_type_names(avro_schema.type)
elif type(avro_schema) == ArraySchema:
items_type = create_athena_column_schema(avro_schema.items)
return f'array<{items_type}>'
elif type(avro_schema) == MapSchema:
values_type = avro_schema.values.type
return f'map<string,{values_type}>'
elif type(avro_schema) == RecordSchema:
field_schemas = []
for field in avro_schema.fields:
field_name = field.name.lower()
field_type = create_athena_column_schema(field.type)
field_schemas.append(f'{field_name}:{field_type}')
field_schema_concatenated = ','.join(field_schemas)
return f'struct<{field_schema_concatenated}>'
elif type(avro_schema) == UnionSchema:
# pick the first schema which is not null
union_schemas_not_null = [s for s in avro_schema.schemas if s.type != 'null']
if len(union_schemas_not_null) > 0:
return create_athena_column_schema(union_schemas_not_null[0])
else:
raise Exception('union schemas contains only null schema')
elif type(avro_schema) in [EnumSchema, FixedSchema]:
return 'string'
else:
raise Exception(f'unknown avro schema type {avro_schema.type}')
def rename_type_names(typ: str) -> str:
if typ in ['long']:
return 'bigint'
else:
return typ

Related

dynamoDB complaining about primary key in FilterExpression

Problem
I'm getting the following error after performing a Query to my DynamoDB instance:
Query failed: Validation Error: Filter Expression can only contain non-primary key attributes: Primary key attribute: baked_date_and_baker_uuid
The problem is I'm positive I'm not using any key of the selected GSI in my query.
Context
I have two GSI with the following schemas
{
"IndexName": "cake_index",
"KeySchema": [
{"AttributeName": "cake_uuid", "KeyType": "HASH"},
{"AttributeName": "baked_date_and_baker_uuid","KeyType": "RANGE",},
],
"Projection": {"ProjectionType": "ALL"},
},
{
"IndexName": "kutchen_index",
"KeySchema": [
{"AttributeName": "kutchen_uuid", "KeyType": "HASH"},
{"AttributeName": "baked_date_and_baker_uuid","KeyType": "RANGE",},
],
"Projection": {"ProjectionType": "ALL"},
}
And doing the following query with boto3
def get_dessert_data(
dessert_uuid: str,
dessert_type: str,
before_than: datetime,
after_than: datetime,
starting_token: Optional[Dict]
):
filters = None
# Select appropiate GSI partition_key
if dessert_type == "cake":
index_name = "cake_index"
gsi_key = "cake_uuid"
elif dessert_type == "kutchen":
index_name = "kutchen_index"
gsi_key = "kutchen_uuid"
else:
# Handle error
# Process conditions to apply to sort key
if before_than is not None and after_than is not None:
parsed_before = parse_date_into_gis_format(before_than)
parsed_after = parse_date_into_utc_string(after_than)
conditions &= Key("baked_date_and_baker_uuid").lte(f"{parsed_before}")
filters = Attr("created").gte(parsed_after) # <= Only time using filters and it's not a GSI key
elif before_than is not None:
parsed_datetime = parse_date_into_gis_format(before_than)
conditions &= Key("baked_date_and_baker_uuid").lte(f"{parsed_datetime}")
elif after_than is not None:
parsed_datetime = parse_date_into_gis_format((after_than)
conditions &= Key("baked_date_and_baker_uuid").gte(f"{parsed_datetime}")
# Finally generate the token for pagination, passing primary key for the table and the GSI.
if starting_token is not None:
starting_token = {
"table_partition_key": starting_token["table_partition_key"],
"table_sort_key": starting_token["table_sort_key"],
gsi_key: dessert_uuid,
"baked_date_and_baker_uuid": starting_token["baked_date_and_baker_uuid"],
}
pagination_conf = {
"MaxItems": 10,
"PageSize": 5,
"ExclusiveStartKey": starting_token
}
paginate_kwargs = dict(
TableName="yummies",
IndexName=index_name,
KeyConditionExpression=conditions,
ScanIndexForward=not reverse,
PaginationConfig=pagination_conf,
)
if filters is not None:
paginate_kwargs["FilterExpression"] = filters
# return pagination with paginate_kwargs
More information
The error does not appears consistently, in fact is "new" as of today.
We managed to get the error again by running the raw DynamoDB query obtained from logs.
I'm fairly sure it's not related to the pagination configuration.
According to this,
To clarify, are you doing a query on your GSI, instead of your base
table? If so, the error message will make sense because
QueryFilter/FilterExpression doesn’t allow hash and range keys, so if
you are querying the GSI it won’t allow FilterExpression on your GSI
hash or range keys.
I've re-checked that the filter expression is not using any key fields from the selected GSI.
So what's causing the error? And why is singling out baked_date_and_baker_uuid as an element in FilterExpression?

What are the extra values added to DynamoDB streams and how do I remove them?

I am using DynamoDB streams to sync data to Elasticsearch using Lambda
The format of the data (from https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html) looks like:
"NewImage": {
"Timestamp": {
"S": "2016-11-18:12:09:36"
},
"Message": {
"S": "This is a bark from the Woofer social network"
},
"Username": {
"S": "John Doe"
}
},
So two questions.
What is the "S" that the stream attaches. I am assuming it is to indicate string or stream, but I can't find any documentation.
Is there an option to exclude this from the stream or do I have to write code in my lambda function to remove it?
What you are seeing is the DynamoDB Data Type Descriptors. This is how data is stored in DynamoDB (or at least how it is exposed via the low level APIs). There are SDKs is various languages that will convert this to JSON.
For Python: https://boto3.amazonaws.com/v1/documentation/api/latest/_modules/boto3/dynamodb/types.html
'TypeSerializer'
deserializer = boto3.dynamodb.types.TypeDeserializer()
dic = {key: deserializer.deserialize(val) for key,val in record['dynamodb']['NewImage'].items()}
def decimal_default(obj):
if isinstance(obj, decimal.Decimal):
return float(obj)
raise TypeError
json.dumps(dic, default=decimal_default)
If you want to index in elasticsearch you have to do another json.loads() to convert to a Python dictionary.
The S indicates that the value of the attribute is simply a scalar string (S) attribute type. Each DynamoDB item attribute's key name is always a string though the attribute value doesn't have to be a scalar string. 'Naming Rules and Data Types' details each attribute data type. A string is a scalar type which is different than a document type or a set type.
There are different views of a stream record however there is no stream view that omits the item's attribute value code and also provides the attribute value. Each possible StreamViewType is explained in 'Capturing Table Activity with DynamoDB streams'.
Have fun!

AWS: Transforming data from DynamoDB before it's sent to Cloudsearch

I'm trying to set up AWS' Cloudsearch with a DynamoDB table. My data structure is something like this:
{
"name": "John Smith",
"phone": "0123 456 789"
"business": {
"name": "Johnny's Cool Co",
"id": "12345",
"type": "contractor",
"suburb": "Sydney"
},
"profession": {
"name": "Plumber",
"id": "20"
},
"email": "johnsmith#gmail.com",
"id": "354684354-4b32-53e3-8949846-211384",
}
Importing this data from DynamoDB -> Cloudsearch is a breeze, however I want to be able to index on some of these nested object parameters (like business.name, profession.name etc).
Cloudsearch is pulling in some of the nested objects like suburb, but it seems like it's impossible for it to differentiate between the name in the root of the object and the name within the business and profession objects.
Questions:
How do I make these nested parameters searchable? Can I index on business.name or something?
If #1 is not possible, can I somehow send my data through a transforming function before it gets to Cloudsearch? This way I could flatten all of my objects and give the fields unique names like businessName and professionName
EDIT:
My solution at the moment is to have a separate DynamoDB table which replicates our users table, but stores it in a CloudSearch-friendly format. However, I don't like this solution at all so any other ideas are totally welcome!
You can use dynamodb streams and write a function that runs in lambda to capture changes and add documents to cloudsearch, flatenning them at that point, instead of keeping an additional dynamodb table.
For example, within my lambda function I have logic that keeps the list of nested fields (within a "body" parent in this case) and I create a just flatten them with their field name, in the case of duplicate sub-field names you can append the parent name to create a new field such as "body-name" as the key.
... misc. setup ...
headers = { "Content-Type": "application/json" }
indexed_fields = ['app', 'name', 'activity'] #fields to flatten
def handler(event, context): #lambda handler called at each update
document = {} #document to be uploaded to cloudsearch
document['id'] = ... #your uid, from the dynamo update record likely
document['type'] = 'add'
all_fields = {}
#flatten/pull out info you want indexed
for record in event['Records']:
body = record['dynamodb']['NewImage']['body']['M']
for key in indexed_fields:
all_fields[key] = body[key]['S']
document['fields'] = all_fields
#post update to cloudsearch endpoint
r = requests.post(url, auth=awsauth, json=document, headers=headers)

Put json data pipeline definition using Boto3

I have a data pipeline definition in json format, and I would like to 'put' that using Boto3 in Python.
I know you can do this via the AWS CLI using put-pipeline-definition, but Boto3 (and the AWS API) use a different format, splitting the definition into pipelineObjects, parameterObjects and parameterValues.
Do I need to write code to translate from a json definition to that expected by the API/Boto? If so, is there a library that does this?
The AWS CLI has code that does this translation, so I can borrow that!
You could convert from the Data Pipeline exported JSON format to the pipelineObjects format expected by boto3 using a python function of the following form.
def convert_to_pipeline_objects(pipeline_definition_dict):
objects_list = []
for def_object in pipeline_definition_dict['objects']:
new_object = {
'id': def_object['id'],
'name': def_object['name'],
'fields': []
}
for key in def_object.keys():
if key in ('id', 'name'):
continue
if type(def_object[key]) == dict:
new_object['fields'].append(
{
'key': key,
'refValue': def_object[key]['ref']
}
)
else:
new_object['fields'].append(
{
'key': key,
'stringValue': def_object[key]
}
)
objects_list.append(new_object)

how to set table expiry time big query

Need help in setting expiry time for a new table in GBQ.
I am creating/uploading a new file as a table in gbq using the below code,
def uploadCsvToGbq(self, table_name, jsonSchema, csvFile, delim):
job_data = {
'jobReference': {
'projectId': self.project_id,
'job_id': str(uuid.uuid4())
},
#"expires":str(datetime.now()+timedelta(seconds=60)),
#"expirationTime": 20000,
#"defaultTableExpirationMs":20000,
'configuration': {
'load': {'writeDisposition': 'WRITE_TRUNCATE',
'fieldDelimiter': delim,
'skipLeadingRows': 1,
'sourceFormat': 'CSV',
'schema': {
'fields': jsonSchema
},
'destinationTable': {
'projectId': self.project_id,
'datasetId': self.dataset_id,
'tableId': table_name
}
}
}
}
upload = MediaFileUpload(csvFile,
mimetype='application/octet-stream', chunksize=1048576,
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
return self.bigquery.jobs().insert(projectId=self.project_id,
body=job_data,
media_body=upload).execute()
This is a perfect code that uploads that file into GBQ as a new table,now i need to set the expiry time for the table,already i tried setting(which is commented) expires,expirationTime and defaultTableExpirationMs,but nothing works.
Do anyone have any idea?
You should use Tables: patch API and set expirationTime property
Below function creates a table with an expirationTime, so as an alternative solution you can create the table first and insert the data later.
def createTableWithExpire(bigquery, dataset_id, table_id, expiration_time):
"""
Creates a BQ table that will be expired in specified time.
Expiration time can be in Unix timestamp format e.g. 1452627594
"""
table_data = {
"expirationTime": expiration_time,
"tableReference":
{
"tableId": table_id
}
}
return bigquery.tables().insert(
projectId=_PROJECT_ID,
datasetId=dataset_id,
body=table_data).execute()
Also answered by Mikhail in this SO question.
Thankyou both,I combined both solution,but made some modifications to work for mine.
As i am creating the table by uploading csv, i am setting the expirationTime by calling patch method and passing tableid to that,
def createTableWithExpire(bigquery, dataset_id, table_id, expiration_time):
"""
Creates a BQ table that will be expired in specified time.
Expiration time can be in Unix timestamp format e.g. 1452627594
"""
table_data = {
"expirationTime": expiration_time,
}
return bigquery.tables().patch(
projectId=_PROJECT_ID,
datasetId=dataset_id,
tableId=table_id,
body=table_data).execute()
Another alternative is to set the expiration time after the table has been created:
from google.cloud import bigquery
import datetime
client = bigquery.Client()
table_ref = client.dataset('my-dataset').table('my-table') # get table ref
table = client.get_table(table_ref) # get Table object
# set datetime of expiration, must be a datetime type
table.expires = datetime.datetime.combine(datetime.date.today() +
datetime.timedelta(days=2),
datetime.time() )
table = client.update_table(table, ['expires']) # update table