I have an Athena table defined with a template specified like so in cloudformation:
Cloudformation Create
EventsTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref DatabaseName
TableInput:
Description: "My Table"
Name: !Ref TableName
TableType: EXTERNAL_TABLE
StorageDescriptor:
Compressed: True
Columns:
- Name: account_id
Type: string
Comment: "Account Id of the account making the request"
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: !Sub "s3://${EventsBucketName}/events/"
This works well and deploys. I also found out I can have partition projections created as per this doc and this doc
And can make that work with a direct table creation, roughly:
SQL Create
CREATE EXTERNAL TABLE `performance_data.events`
(
`account_id` string,
...
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/events/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.format' = 'yyyy/MM/dd',
'projection.day.range' = '2020/01/01,NOW',
'projection.day.interval' = '1',
'projection.day.interval.unit' = 'DAYS',
'storage.location.template' = 's3://my-bucket/events/${day}/'
)
But I can't find the docs to convert into the cloud formation structure. So my question is, how can I achieve the partition projection shown in the SQL code in cloudformation?
I now have a working solution. The missing piece was really a missing parameter, here is the solution:
MyTableResource:
Type: AWS::Glue::Table
Properties:
CatalogId: MyAccountId
DatabaseName: MyDatabase
TableInput:
Description: "My Table"
Name: mytable
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: day
Type: string
Comment: Day partition
Parameters:
"projection.enabled": "true"
"projection.day.type": "date"
"projection.day.format": "yyyy/MM/dd"
"projection.day.range": "2020/01/01,NOW"
"projection.day.interval": "1"
"projection.day.interval.unit": "DAYS"
"storage.location.template": "s3://my-bucket/events/${day}/"
StorageDescriptor:
Compressed: True
Columns:
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
Parameters:
serialization.format: '1'
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: "s3://my-bucket/events/"
The key addition was:
serialization.format: '1'
This now completely works and one can do a query that using the partition as:
select * from mytable where day > '2022/05/03'
Referring to the CloudFormation reference for the Glue Table TableInput, you can specify PartitionKeys and Parameters. This is the equivalent of PARTITIONED BY and TBLPROPERTIES in the query.
EDIT
As an example, you can refer to this article. The sample below shows how to define the PartitionKeys and how to define a JSON for the Parameters. In your case, you just have to add the projection keys (such as projection.enabled) and values (true).
# Create an Amazon Glue table
CFNTableFlights:
# Creating the table waits for the database to be created
DependsOn: CFNDatabaseFlights
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableInput:
Name: !Ref CFNTableName1
Description: Define the first few columns of the flights table
TableType: EXTERNAL_TABLE
Parameters: {
"classification": "csv"
}
# ViewExpandedText: String
PartitionKeys:
# Data is partitioned by month
- Name: mon
Type: bigint
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: year
Type: bigint
- Name: quarter
Type: bigint
- Name: month
Type: bigint
- Name: day_of_month
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Related
I want to create and deploy a template that itself deploys a product from the AWS service catalog. Here is my template:
Parameters:
ProductId:
Type: String
ProvisioningArtifactName:
Type: String
Description:
Type: String
Region:
Type: CommaDelimitedList
VpcSize:
Type: String
BastionHostKeyName:
Type: String
ProvisioningArtifactName:
Type: String
Resources:
VPCAndMore:
Type: AWS::ServiceCatalog::CloudFormationProvisionedProduct
Properties:
ProductId: ProductId
ProvisioningArtifactName: ProvisioningArtifactName
ProvisioningParameters:
- Key: Description
Value: Description
- Key: AvailabilityZones
Value: Region
- Key: VpcSize
Value: VpcSize
- Key: BastionHostKeyName
Value: BastionHostKeyName
When I try to deploy it manually I enter all parameter values. They are definitely correct and from the correct type. But once I deploy it I get an error like this:
Product ProductId not found. (Service: ServiceCatalog, Status Code: 400, Request ID: 35f27a2a-1317-48d0-815e-16ebe949d039, Extended Request ID: null)
For some reason the ProductId parameter is not resolved it seems like.
What am I missing? Or is CF not supporting parameter resolving outside of ProvisioningParameters?
For Intrinsic function Ref need to reference the values defined like below:
Parameters:
ProductId:
Type: String
ProvisioningArtifactName:
Type: String
Description:
Type: String
Region:
Type: CommaDelimitedList
VpcSize:
Type: String
BastionHostKeyName:
Type: String
ProvisioningArtifactName:
Type: String
Resources:
VPCAndMore:
Type: AWS::ServiceCatalog::CloudFormationProvisionedProduct
Properties:
ProductId: !Ref ProductId
ProvisioningArtifactName: !Ref ProvisioningArtifactName
ProvisioningParameters:
- Key: Description
Value: !Ref Description
- Key: AvailabilityZones
Value: !Ref Region
- Key: VpcSize
Value: !Ref VpcSize
- Key: BastionHostKeyName
Value: !Ref BastionHostKeyName
The problem is that you're only inserting the parameters name without referencing it.
You need to use the intrinsic function !Ref. Like this:
Parameters:
ProductId:
Type: String
ProvisioningArtifactName:
Type: String
Description:
Type: String
Region:
Type: CommaDelimitedList
VpcSize:
Type: String
BastionHostKeyName:
Type: String
ProvisioningArtifactName:
Type: String
Resources:
VPCAndMore:
Type: AWS::ServiceCatalog::CloudFormationProvisionedProduct
Properties:
ProductId: !Ref ProductId
ProvisioningArtifactName: !Ref ProvisioningArtifactName
ProvisioningParameters:
- Key: Description
Value: !Ref Description
- Key: AvailabilityZones
Value: !Ref Region
- Key: VpcSize
Value: !Ref VpcSize
- Key: BastionHostKeyName
Value: !Ref BastionHostKeyName
I have a lot of resources type AWS::Glue::Table in my aws templates. And I do not wont to copy-paste snippet of code from template to template. So idea is to create a reusable nested stack that accepts the params. I did it but one problem is still remaining. I do not know how I can pass columns via params to this stack [{Type: string, Name: type}, {Type: string, Name: timeLogged}] - it is an array of objects. But params accepts an only string type.
I tried to do something like this:
!Split [ "," , "{Type: string, Name: type}, {Type: string, Name: timeLogged}"] - but its did not helped
AWSTemplateFormatVersion: 2010-09-09
Description: The AWS CloudFormation template for creating a Glue table
Parameters:
DestinationBucketName:
Type: String
Description: Destination Regional Bucket Name
DestinationBucketPrefix:
Type: String
Description: Destination Regional Bucket Prefix
DatabaseName:
Type: String
Description: Database for Kinesis Analytics
TableName:
Type: String
Description: Table for Kinesis Analytics
InputFormat:
Type: String
Description: Input format for data
OutputFormat:
Type: String
Description: Output format for data
SerializationLibrary:
Type: String
Description: Serialization library for converting data
Resources:
LogsCollectionTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref DatabaseName
CatalogId: !Ref AWS::AccountId
TableInput:
Name: !Ref TableName
Description: Table for storing data
TableType: EXTERNAL_TABLE
StorageDescriptor:
Columns: [{Type: string, Name: type}, {Type: string, Name: timeLogged}]
Location: !Sub s3://${DestinationBucketName}/${DestinationBucketPrefix}
InputFormat: !Ref InputFormat
OutputFormat: !Ref OutputFormat
SerdeInfo:
SerializationLibrary: !Ref SerializationLibrary
Short answer: You currently can not. You would need to pass every parameter manually.
Source
I have been searching for an example of how to set up Cloudformation for a glue workflow which includes triggers, jobs, and crawlers, but I haven't been able to find much information on it.
This is the only piece of information I am able to find from AWS
{
"Type" : "AWS::Glue::Workflow",
"Properties" : {
"DefaultRunProperties" : Json,
"Description" : String,
"Name" : String,
"Tags" : Json
}
}
Here's an example of a workflow with one crawler and a job to be run after the crawler finishes.
It is defined through tagging the triggers with the WorkflowName.
I believe there can be only one SCHEDULED or ON_DEMAND trigger to start the workflow. All the other triggers in the workflow need to be CONDITIONAL on the jobs / crawlers. That's probably how CloudFormation knows how to build the DAG.
Also see how the workflow parameters are defined as a json in the DefaultRunProperties.
---
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
BaseBucket:
Description: Bucket used by my workflow jobs
Type: String
Resources:
MyWorkflow:
Type: AWS::Glue::Workflow
Properties:
DefaultRunProperties:
{
"workflowParameter1": "Foo",
"workflowParameter2": "Bar",
"bucket": { "Fn::Sub": "${BaseBucket}" }
}
Description: Workflow for orchestrating my jobs
Name: MyWorkflowName
WorkflowCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: MyCrawler
Role: MyCrawlerRole
Description: A crawler to run as the first step in the workflow
DatabaseName: MyDatabase
Targets:
S3Targets:
- Path: !Sub "s3://${BaseBucket}/"
WorkflowJob:
Type: AWS::Glue::Job
Properties:
Description: Glue job to run after the crawler
Name: MyWorkflowJob
Role: MyJobRole
Command:
Name: pythonshell
PythonVersion: 3
ScriptLocation: !Sub "s3://${BaseBucket}/my_workflow_job_script.py"
WorkflowStartTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: StartTrigger
Type: ON_DEMAND
Description: Trigger for starting the workflow
Actions:
- CrawlerName: !Ref WorkflowCrawler
WorkflowName: !Ref MyWorkflow
WorkflowJobTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: CrawlerSuccessfulTrigger
Type: CONDITIONAL
StartOnCreation: True
Description: Trigger to start the glue job
Actions:
- JobName: !Ref WorkflowJob
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref WorkflowCrawler
CrawlState: SUCCEEDED
WorkflowName: !Ref MyWorkflow
Here is an example of a Glue workflow using triggers, crawlers and a job to convert JSON to Parquet:
JSONtoParquetWorkflow:
Type: AWS::Glue::Workflow
Properties:
Name: json-to-parquet-workflow
Description: Workflow for orchestrating JSON to Parquet conversion
RawJSONCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: raw-json-crawler-trigger
Description: Start crawler for raw JSON data
Type: ON_DEMAND
Actions:
- CrawlerName: !Ref RawJSONCrawler
JSONToParquetETLJobTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: json-to-parquet-etl-trigger
Description: Start JSON to Parquet ETL job
Type: CONDITIONAL
StartOnCreation: True
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref RawJSONCrawler
CrawlState: SUCCEEDED
Actions:
- JobName: !Ref JSONToParquetETLJob
RawParquetCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: raw-parquet-crawler-trigger
Description: Start crawler for raw Parquet data
Type: CONDITIONAL
StartOnCreation: True
Predicate:
Conditions:
- LogicalOperator: EQUALS
JobName: !Ref JSONToParquetETLJob
State: SUCCEEDED
Actions:
- CrawlerName: !Ref RawParquetCrawler
There is no simple example to find, therefore I created an example AWS Glue Workflow: Getting started which is using AWS Cloudformation template. This example is very easy but explained with diagrams.
I have some data in S3, created a schema in the Glue catalog, and then exposed it to QuickSight via Athena. All this works great when I create it by clicking in the console.
I then converted it to the following CloudFormation:
AnalyticsDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: analytics
CatalogId: !Ref AWS::AccountId
RawAnalysisAnalyticsTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref AnalyticsDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: analysis_raw
TableType: EXTERNAL_TABLE
Parameters:
classification: json
StorageDescriptor:
Columns:
- {Name: id, Type: string}
- {Name: treeid, Type: string}
- {Name: patientid, Type: string}
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Location: s3://my-bucket/dynamodb/Analysis/
NumberOfBuckets: 0
SerdeInfo:
Parameters: {paths: 'id,patientid,treeid'}
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
SortColumns: []
StoredAsSubDirectories: false
However, when I try to pull the CF-created table into QuickSight I get:
Your database generated a SQL exception. This can be caused by query timeouts, resource constraints, unexpected DDL alterations before or during a query, and other database errors. Check your database settings and your query, and try again.
region: us-east-1
timestamp: 1544113019756
requestId: 5ab8f9a2-f972-11e8-b201-154c30728c75
sourceErrorCode: 0
sourceErrorMessage: [Simba][JDBC](11380) Null pointer exception.
sourceErrorState: HY000
sourceException: java.sql.SQLException
sourceType: ATHENA
Does anyone have any idea what this error means or how I can troubleshoot it? I've compared all the properties of the manually-created table to the CloudFormation-created table, and they seem identical.
Max's answer should be the accepted answer here. Replicated this and only solution that worked was to add the PartitionKeys: [] parameter. I had initially added it as a child or StorageDescription, which didn't work. Has to be added at the TableInput child level as specified in the docs. It is the right answer because none of the other conditions listed here (security, etc) will give the NullPointerException that is referenced in the question.
From the documentation, AWS::Athena::NamedQuery, it is unclear how to attach Athena to an S3 bucket specified in the same stack.
If I had to guess from the example, I would imagine that you can write a template like,
Resources:
MyS3Bucket:
Type: AWS::S3::Bucket
... other params ...
AthenaNamedQuery:
Type: AWS::Athena::NamedQuery
Properties:
Database: "db_name"
Name: "MostExpensiveWorkflow"
QueryString: >
CREATE EXTERNAL TABLE db_name.test_table
(...) LOCATION s3://.../path/to/folder/
Would a template like the above work? Upon stack creation, will the table db_name.test_table be available to run queries on?
Turns out the way you connect the S3 and Athena is to make a Glue table! How silly of me!! Of course Glue is how you connect things!
Sarcasm aside, this is a template that worked for me when using AWS::Glue::Table and AWS::Glue::Database,
Resources:
MyS3Bucket:
Type: AWS::S3::Bucket
MyGlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: my-glue-database
Description: "Glue beats tape"
CatalogId: !Ref AWS::AccountId
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my-glue-table
Parameters: { "classification" : "csv" }
StorageDescriptor:
Location:
Fn::Sub: "s3://${MyS3Bucket}/"
InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
SerdeInfo:
Parameters: { "separatorChar" : "," }
SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde"
StoredAsSubDirectories: false
Columns:
- Name: column0
Type: string
- Name: column1
Type: string
After this, the database and table were in the AWS Athena Console!