Glue AWS creating a data catalog table on boto3 python - amazon-web-services

I have been trying to create a table within our data catalog using the python API. Following the documentation posted here and here for the API. I can understand how that goes. Nevertheless, I need to undestand how to declare a field structure when I create the table because when I take a look on the Storage Definition for the table here there is any explanation about how should I define this type of column for my table. In addition. I dont see the classification property for the table where is covered. Maybe on properties? I have used the boto3 documentation for this sample
code:
import boto3
client = boto3.client(service_name='glue', region_name='us-east-1')
response = client.create_table(
DatabaseName='dbname',
TableInput={
'Name': 'tbname',
'Description': 'tb description',
'Owner': 'I'm',
'StorageDescriptor': {
'Columns': [
{ 'Name': 'agents', 'Type': 'struct','Comment': 'from deserializer' },
{ 'Name': 'conference_sid', 'Type': 'string','Comment': 'from deserializer' },
{ 'Name': 'call_sid', 'Type': 'string','Comment': 'from deserializer' }
] ,
'Location': 's3://bucket/location/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'SerdeInfo': { 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'}
},
'TableType' : "EXTERNAL_TABLE"} )

Found this post because I ran into the same issue and eventually found the solution so you could do as type:
array<struct<id:string,timestamp:bigint,message:string>>
I found this "hint" while using the AWS Console and clicking on a data type of an existing table created via a Crawler. It hints:
An ARRAY of scalar type as a top - level column.
ARRAY <STRING>
An ARRAY with elements of complex type (STRUCT).
ARRAY < STRUCT <
place: STRING,
start_year: INT
>>
An ARRAY as a field (CHILDREN) within a STRUCT. (The STRUCT is inside another ARRAY, because it is rare for a STRUCT to be a top-level column.)
ARRAY < STRUCT <
spouse: STRING,
children: ARRAY <STRING>
>>
A STRUCT as the element type of an ARRAY.
ARRAY < STRUCT <
street: STRING,
city: STRING,
country: STRING
>>

Related

How do you debug google deployment manager templates?

Im looking at this example: https://github.com/GoogleCloudPlatform/deploymentmanager-samples/tree/master/examples/v2/cloud_functions
which uses this template. I added a print statement to it, but how do I see this output?
import base64
import hashlib
from StringIO import StringIO
import zipfile
def GenerateConfig(ctx):
"""Generate YAML resource configuration."""
in_memory_output_file = StringIO()
function_name = ctx.env['deployment'] + 'cf'
zip_file = zipfile.ZipFile(
in_memory_output_file, mode='w', compression=zipfile.ZIP_DEFLATED)
####################################################
############ HOW DO I SEE THIS????? ################
print('heelo wworrld')
####################################################
####################################################
for imp in ctx.imports:
if imp.startswith(ctx.properties['codeLocation']):
zip_file.writestr(imp[len(ctx.properties['codeLocation']):],
ctx.imports[imp])
zip_file.close()
content = base64.b64encode(in_memory_output_file.getvalue())
m = hashlib.md5()
m.update(content)
source_archive_url = 'gs://%s/%s' % (ctx.properties['codeBucket'],
m.hexdigest() + '.zip')
cmd = "echo '%s' | base64 -d > /function/function.zip;" % (content)
volumes = [{'name': 'function-code', 'path': '/function'}]
build_step = {
'name': 'upload-function-code',
'action': 'gcp-types/cloudbuild-v1:cloudbuild.projects.builds.create',
'metadata': {
'runtimePolicy': ['UPDATE_ON_CHANGE']
},
'properties': {
'steps': [{
'name': 'ubuntu',
'args': ['bash', '-c', cmd],
'volumes': volumes,
}, {
'name': 'gcr.io/cloud-builders/gsutil',
'args': ['cp', '/function/function.zip', source_archive_url],
'volumes': volumes
}],
'timeout':
'120s'
}
}
cloud_function = {
'type': 'gcp-types/cloudfunctions-v1:projects.locations.functions',
'name': function_name,
'properties': {
'parent':
'/'.join([
'projects', ctx.env['project'], 'locations',
ctx.properties['location']
]),
'function':
function_name,
'labels': {
# Add the hash of the contents to trigger an update if the bucket
# object changes
'content-md5': m.hexdigest()
},
'sourceArchiveUrl':
source_archive_url,
'environmentVariables': {
'codeHash': m.hexdigest()
},
'entryPoint':
ctx.properties['entryPoint'],
'httpsTrigger': {},
'timeout':
ctx.properties['timeout'],
'availableMemoryMb':
ctx.properties['availableMemoryMb'],
'runtime':
ctx.properties['runtime']
},
'metadata': {
'dependsOn': ['upload-function-code']
}
}
resources = [build_step, cloud_function]
return {
'resources':
resources,
'outputs': [{
'name': 'sourceArchiveUrl',
'value': source_archive_url
}, {
'name': 'name',
'value': '$(ref.' + function_name + '.name)'
}]
}
EDIT: this is in no way a solution to this problem but I found that if I set a bunch of outputs for info im interested in seeing it kind of helps. So I guess you could roll your own sort of log-ish thing by collecting info/output into a list or something in your python template and then passing all that back as an output- not great but its better than nothing
Deployment Manager is an infrastructure deployment service that automates the creation and management of Google Cloud Platform (GCP) resources. What you are trying to do on deployment manager is not possible due to its managed environment.
As of now, the only way to troubleshoot is to rely on the expanded template from the Deployment Manager Dashboard. There is already a feature request in order to address your use case here. I advise you to star the feature request in order to get updates via email and to place a comment in order to show the interest of the community. All the official communication regarding that feature will be posted there.

How to use AVRO format on AWS Glue/Athena

I've a few topics in Kafka that are writing AVRO files into S3 buckets and I would like to perform some queries on bucket using AWS Athena.
I'm trying to create a table but AWS Glue crawler runs and doesn't add my table (it works if I change file type to JSON). I've tried to create a table from Athena console but it doesn't show support to AVRO file.
Any idea on how to make it work?
I suggest doing it manually and not via Glue. Glue only works for the most basic situations, and this falls outside that, unfortunately.
You can find the documentation on how to create an Avro table here: https://docs.aws.amazon.com/athena/latest/ug/avro.html
The caveat for Avro tables is that you need to specify both the table columns and the Avro schema. This may look weird and redundant, but it's how Athena/Presto works. It needs a schema to know how to interpret the files, and then it needs to know which of the properties in the files you want to expose as columns (and their types, which may or may not match the Avro types).
CREATE EXTERNAL TABLE avro_table (
foo STRING,
bar INT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal' = '
{
"type": "record",
"name": "example",
"namespace": "default",
"fields": [
{
"name": "foo",
"type": ["null", "string"],
"default": null
},
{
"name": "bar",
"type": ["null", "int"],
"default": null
}
]
}
')
STORED AS AVRO
LOCATION 's3://some-bucket/data/';
Notice how the Avro schema appears as a JSON document inside of a serde property value (single quoted) – the formatting is optional, but makes this example easier to read.
Doing it manually seems to be the way to make it work.
Here is some code to generate the Athena schema directly from a literal avro schema. It works with avro-python3 on python3.7. It is taken from here: https://github.com/dataqube-GmbH/avro2athena (I am the owner of the repo)
from avro.schema import Parse, RecordSchema, PrimitiveSchema, ArraySchema, MapSchema, EnumSchema, UnionSchema, FixedSchema
def create_athena_schema_from_avro(avro_schema_literal: str) -> str:
avro_schema: RecordSchema = Parse(avro_schema_literal)
column_schemas = []
for field in avro_schema.fields:
column_name = field.name.lower()
column_type = create_athena_column_schema(field.type)
column_schemas.append(f"`{column_name}` {column_type}")
return ', '.join(column_schemas)
def create_athena_column_schema(avro_schema) -> str:
if type(avro_schema) == PrimitiveSchema:
return rename_type_names(avro_schema.type)
elif type(avro_schema) == ArraySchema:
items_type = create_athena_column_schema(avro_schema.items)
return f'array<{items_type}>'
elif type(avro_schema) == MapSchema:
values_type = avro_schema.values.type
return f'map<string,{values_type}>'
elif type(avro_schema) == RecordSchema:
field_schemas = []
for field in avro_schema.fields:
field_name = field.name.lower()
field_type = create_athena_column_schema(field.type)
field_schemas.append(f'{field_name}:{field_type}')
field_schema_concatenated = ','.join(field_schemas)
return f'struct<{field_schema_concatenated}>'
elif type(avro_schema) == UnionSchema:
# pick the first schema which is not null
union_schemas_not_null = [s for s in avro_schema.schemas if s.type != 'null']
if len(union_schemas_not_null) > 0:
return create_athena_column_schema(union_schemas_not_null[0])
else:
raise Exception('union schemas contains only null schema')
elif type(avro_schema) in [EnumSchema, FixedSchema]:
return 'string'
else:
raise Exception(f'unknown avro schema type {avro_schema.type}')
def rename_type_names(typ: str) -> str:
if typ in ['long']:
return 'bigint'
else:
return typ

Boto3 athena query without saving data to s3

I am trying to use boto3 to run a set of queries and don't want to save the data to s3. Instead I just want to get the results and want to work with those results. I am trying to do the following
import boto3
client = boto3.client('athena')
response = client.start_query_execution(
QueryString='''SELECT * FROM mytable limit 10''',
QueryExecutionContext={
'Database': 'my_db'
}.
ResultConfiguration={
'OutputLocation': 's3://outputpath',
}
)
print(response)
But here I don't want to give ResultConfiguration because I don't want to write the results anywhere. But If I remove the ResultConfiguration parameter I get the following error
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "ResultConfiguration"
So it seems like giving s3 output location for writing is mendatory. So what could the way to avoid this and get the results only in response?
The StartQueryExecution action indeed requires a S3 output location. The ResultConfiguration parameter is mandatory.
The alternative way to query Athena is using JDBC or ODBC drivers. You should probably use this method if you don't want to store results in S3.
You will have to specify an S3 temp bucket location whenever running the 'start_query_execution' command. However, you can get a result set (a dict) by running the 'get_query_results' method using the query id.
The response (dict) will look like this:
{
'UpdateCount': 123,
'ResultSet': {
'Rows': [
{
'Data': [
{
'VarCharValue': 'string'
},
]
},
],
'ResultSetMetadata': {
'ColumnInfo': [
{
'CatalogName': 'string',
'SchemaName': 'string',
'TableName': 'string',
'Name': 'string',
'Label': 'string',
'Type': 'string',
'Precision': 123,
'Scale': 123,
'Nullable': 'NOT_NULL'|'NULLABLE'|'UNKNOWN',
'CaseSensitive': True|False
},
]
}
},
'NextToken': 'string'
}
For more information, see boto3 client doc: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.get_query_results
You can then delete all files in the S3 temp bucket you've specified.
You still need to provide s3 as temporary location for Athena to save the data although you want to process the data using python. But you can page through the data as tuple using Pagination API. please refer to the example here. Hope that helps

Map different Sort Key responses to Appsync Schema values

So here is my schema:
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
getAllModels(count: Int, nextToken: String): PaginatedModels!
}
type PaginatedModels {
models: [Model!]!
nextToken: String
}
I would like to call 'getAllModels' and have all of it's data, and all of it's tags be filled in.
But here is the thing. Tags are stored via sort keys. Like so
PartionKey | SortKey
Model-0 | Model-0
Model-0 | Tag-Tree
Model-0 | Tag-Building
Is it possible to transform the 'Tag' sort keys into the Tags: [String] array in the schema via a DynamoDB resolver? Or must I do something extra fancy through a lambda? Or is there a smarter way to do this?
To clarify, are you storing objects like this in DynamoDB:
{ PartitionKey (HASH), Tag (SortKey), Name, Version, FBX, CreatedAt, Description }
and using a DynamoDB Query operation to fetch all rows for a given HashKey.
Query #PartitionKey = :PartitionKey
and getting back a list of objects some of which have a different "Tag" value and one of which is "Model-0" (aka the same value as the partition key) and I assume that record contains all other values for the record. E.G.
[
{ PartitionKey, Tag: 'ValueOfPartitionKey', Name, Version, FBX, CreatedAt, ... },
{ PartitionKey, Tag: 'Tag-Tree' },
{ PartitionKey: Tag: 'Tag-Building' }
]
You can definitely write resolver logic without too much hassle that reduces the list of model objects into a single object with a list of "Tags". Let's start with a single item and see how to implement a getModel(id: ID!): Model query:
First define the response mapping template that will get all rows for a partition key:
{
"version" : "2017-02-28",
"operation" : "Query",
"query" : {
"expression": "#PartitionKey = :id",
"expressionValues" : {
":id" : {
"S" : "${ctx.args.id}"
}
},
"expressionNames": {
"#PartitionKey": "PartitionKey" # whatever the table hash key is
}
},
# The limit will have to be sufficiently large to get all rows for a key
"limit": $util.defaultIfNull(${ctx.args.limit}, 100)
}
Then to return a single model object that reduces "Tag" to "Tags" you can use this response mapping template:
#set($tags = [])
#set($result = {})
#foreach( $item in $ctx.result.items )
#if($item.PartitionKey == $item.Tag)
#set($result = $item)
#else
$util.qr($tags.add($item.Tag))
#end
#end
$util.qr($result.put("Tags", $tags))
$util.toJson($result)
This will return a response like this:
{
"PartitionKey": "...",
"Name": "...",
"Tags": ["Tag-Tree", "Tag-Building"],
}
Fundamentally I see no problem with this but its effectiveness depends upon your query patterns. Extending this to the getAll use is doable but will require a few changes and most likely a really inefficient Scan operation due to the fact that the table will be sparse of actual information since many records are effectively just tags. You can alleviate this with GSIs pretty easily but more GSIs means more $.
As an alternative approach, you can store your Tags in a different "Tags" table. This way you only store model information in the Model table and tag information in the Tag table and leverage GraphQL to perform the join for you. In this approach have Query.getAllModels perform a "Scan" (or Query) on the Model table and then have a Model.Tags resolver that performs a Query against the Tag table (HK: ModelPartitionKey, SK: Tag). You could then get all tags for a model and later create a GSI to get all models for a tag. You do need to consider that now the nested Model.Tag query will get called once per model but Query operations are fast and I've seen this work well in practice.
Hope this helps :)

Google BigQuery support for Avro logicalTypes

As Google claims there is no support for conversion from Avro logicalType to BigQuery specific type (as described here on the bottom).
However I'm able to load Avro file with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'test_timestamp', 'type': 'long', 'logicalType': 'timestamp-millis'},
],
}
onto BigQuery with column of type TIMESTAMP.
The situation is different with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'testdate', 'type': 'int', 'logicalType': 'date'},
],
}
and BigQuery table with column of type DATE. I was using bq load in the following way (in both cases):
bq --location=EU load --source_format=AVRO --project_id=test-project dataset.table "gs://bucket/test_file.avro"
and it failed with exception:
Field testdate has changed type from DATE to INTEGER
Is there any chance that logicalTypes will be supported by BigQuery or is there any elegant way to workaround such situation? (I'm aware of workaround where temporary table is used and then there is BQL select that casts TIMESTAMPS to DATES but it's not really pretty :P)
Native understanding for Avro Logical Types is now available publicly for all BigQuery users. Please refer to the documentation page here for more details: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types