I have some data in S3, created a schema in the Glue catalog, and then exposed it to QuickSight via Athena. All this works great when I create it by clicking in the console.
I then converted it to the following CloudFormation:
AnalyticsDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: analytics
CatalogId: !Ref AWS::AccountId
RawAnalysisAnalyticsTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref AnalyticsDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: analysis_raw
TableType: EXTERNAL_TABLE
Parameters:
classification: json
StorageDescriptor:
Columns:
- {Name: id, Type: string}
- {Name: treeid, Type: string}
- {Name: patientid, Type: string}
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Location: s3://my-bucket/dynamodb/Analysis/
NumberOfBuckets: 0
SerdeInfo:
Parameters: {paths: 'id,patientid,treeid'}
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
SortColumns: []
StoredAsSubDirectories: false
However, when I try to pull the CF-created table into QuickSight I get:
Your database generated a SQL exception. This can be caused by query timeouts, resource constraints, unexpected DDL alterations before or during a query, and other database errors. Check your database settings and your query, and try again.
region: us-east-1
timestamp: 1544113019756
requestId: 5ab8f9a2-f972-11e8-b201-154c30728c75
sourceErrorCode: 0
sourceErrorMessage: [Simba][JDBC](11380) Null pointer exception.
sourceErrorState: HY000
sourceException: java.sql.SQLException
sourceType: ATHENA
Does anyone have any idea what this error means or how I can troubleshoot it? I've compared all the properties of the manually-created table to the CloudFormation-created table, and they seem identical.
Max's answer should be the accepted answer here. Replicated this and only solution that worked was to add the PartitionKeys: [] parameter. I had initially added it as a child or StorageDescription, which didn't work. Has to be added at the TableInput child level as specified in the docs. It is the right answer because none of the other conditions listed here (security, etc) will give the NullPointerException that is referenced in the question.
Related
I am trying to create a data quality validation for set of files in s3. For that I have chose AWS data brew and have created a dataset, data quality rules
and a data profile job via SAM template.
Here, Once a dataset is created I have to refer the Arn of the dataset while creating the ruleset and also the Arn of ruleset for the profile job.
On checking documentation I can see that ARN is not part of outputs for the dataset and data quality rule set. So is it possible to dynamically refer these
values. Or should I create rulesets separately.
SampleDataSet:
Type: AWS::DataBrew::Dataset
Properties:
Name: SampleDataSet
Input:
S3InputDefinition:
Bucket: *****
Key: *****
SampleRuleSet:
Type: AWS::DataBrew::Ruleset
Properties:
Name: SampleRuleSet
Rules:
- Name: rule1
Disabled : true
CheckExpression: "AGG(DUPLICATE_ROWS_COUNT) <= :val1"
SubstitutionMap:
- Value: "0"
ValueReference: ":val1"
TargetArn: !GetAtt SampleDataSet.Arn
DependsOn: SampleDataSet
SampleProfileJob:
Type: AWS::DataBrew::Job
Properties:
Name: SampleProfileJob
Type: PROFILE
RoleArn: !GetAtt GenericDataBrewDataQualityRole.Arn
DatasetName: SampleDataSet
Timeout: 5
ValidationConfigurations:
- RulesetArn: !GetAtt SampleRuleSet.Arn
OutputLocation:
Bucket: *****
DependsOn: SampleRuleSet
I have a lot of resources type AWS::Glue::Table in my aws templates. And I do not wont to copy-paste snippet of code from template to template. So idea is to create a reusable nested stack that accepts the params. I did it but one problem is still remaining. I do not know how I can pass columns via params to this stack [{Type: string, Name: type}, {Type: string, Name: timeLogged}] - it is an array of objects. But params accepts an only string type.
I tried to do something like this:
!Split [ "," , "{Type: string, Name: type}, {Type: string, Name: timeLogged}"] - but its did not helped
AWSTemplateFormatVersion: 2010-09-09
Description: The AWS CloudFormation template for creating a Glue table
Parameters:
DestinationBucketName:
Type: String
Description: Destination Regional Bucket Name
DestinationBucketPrefix:
Type: String
Description: Destination Regional Bucket Prefix
DatabaseName:
Type: String
Description: Database for Kinesis Analytics
TableName:
Type: String
Description: Table for Kinesis Analytics
InputFormat:
Type: String
Description: Input format for data
OutputFormat:
Type: String
Description: Output format for data
SerializationLibrary:
Type: String
Description: Serialization library for converting data
Resources:
LogsCollectionTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref DatabaseName
CatalogId: !Ref AWS::AccountId
TableInput:
Name: !Ref TableName
Description: Table for storing data
TableType: EXTERNAL_TABLE
StorageDescriptor:
Columns: [{Type: string, Name: type}, {Type: string, Name: timeLogged}]
Location: !Sub s3://${DestinationBucketName}/${DestinationBucketPrefix}
InputFormat: !Ref InputFormat
OutputFormat: !Ref OutputFormat
SerdeInfo:
SerializationLibrary: !Ref SerializationLibrary
Short answer: You currently can not. You would need to pass every parameter manually.
Source
I am trying to create a database on glue using cloud formation but it fails with the below error. Am I missing something?
Property validation failure: [The property {/DatabaseInput} is required, The property {/CatalogId} is required]
This is how my template code block looks like
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput: !Ref TeamName
According to the docs the DatabaseInput should have the following structure:
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Description: String
LocationUri: String
Name: String
Parameters: Json
Thus the question is, what TeamName is in your tempalte?
I'm trying to set up AWS Glue to read from a RDS Postgres using CloudFormation. In order to do that I need to create a crawler using the JdbcTarget option. (Or do I not?)
Records:
Type: 'AWS::Glue::Crawler'
Properties:
DatabaseName: transact
Targets:
JdbcTargets:
- Path: "jdbc:postgresql://host:5432/database"
Role: !Ref ETLAgent
But creating the stack on CloudFormation wil fail with:
CREATE_FAILED | AWS::Glue::Crawler | Records | Connection name cannot be equal to null or empty. (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException;
Even though the docs say:
ConnectionName
The name of the connection to use for the JDBC target.
Required: No
What is the correct AWS Glue setup using CloudFormation that will allow me to read from RDS?
You're really missing the ConnectionName property, which should carry the name of connection resource which you're missing. The Path property you're setting is used to select the schemas/tables to crawl (dbname/%/% to include all). Consult CloudFormation docs on Crawler JDBCTarget for details.
Your template should look something like
MyDbConnection:
Type: "AWS::Glue::Connection"
Properties:
CatalogId: !Ref 'AWS::AccountId'
ConnectionInput:
Description: "JDBC Connection to my RDS DB"
PhysicalConnectionRequirements:
AvailabilityZone: "eu-central-1a"
SecurityGroupIdList:
- my-sec-group-id
SubnetId: my-subnet-id
ConnectionType: "JDBC"
ConnectionProperties:
"JDBC_CONNECTION_URL": "jdbc:postgresql://host:5432/database"
"USERNAME": "my-db-username"
"PASSWORD": "my-password"
Records:
Type: 'AWS::Glue::Crawler'
Properties:
DatabaseName: transact
Targets:
JdbcTargets:
- ConnectionName: !Ref MyDbConnection
Path: "database/%/%"
Role: !Ref ETLAgent
From the documentation, AWS::Athena::NamedQuery, it is unclear how to attach Athena to an S3 bucket specified in the same stack.
If I had to guess from the example, I would imagine that you can write a template like,
Resources:
MyS3Bucket:
Type: AWS::S3::Bucket
... other params ...
AthenaNamedQuery:
Type: AWS::Athena::NamedQuery
Properties:
Database: "db_name"
Name: "MostExpensiveWorkflow"
QueryString: >
CREATE EXTERNAL TABLE db_name.test_table
(...) LOCATION s3://.../path/to/folder/
Would a template like the above work? Upon stack creation, will the table db_name.test_table be available to run queries on?
Turns out the way you connect the S3 and Athena is to make a Glue table! How silly of me!! Of course Glue is how you connect things!
Sarcasm aside, this is a template that worked for me when using AWS::Glue::Table and AWS::Glue::Database,
Resources:
MyS3Bucket:
Type: AWS::S3::Bucket
MyGlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: my-glue-database
Description: "Glue beats tape"
CatalogId: !Ref AWS::AccountId
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my-glue-table
Parameters: { "classification" : "csv" }
StorageDescriptor:
Location:
Fn::Sub: "s3://${MyS3Bucket}/"
InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
SerdeInfo:
Parameters: { "separatorChar" : "," }
SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde"
StoredAsSubDirectories: false
Columns:
- Name: column0
Type: string
- Name: column1
Type: string
After this, the database and table were in the AWS Athena Console!