Invalid Schema error in AWS Glue created via Terraform - amazon-web-services

I have a Kinesis Firehose configuration in Terraform, which reads data from Kinesis stream in JSON, converts it to Parquet using Glue and writes to S3.
There is something wrong with data format conversion and I am getting the below error(with some details removed):
{"attemptsMade":1,"arrivalTimestamp":1624541721545,"lastErrorCode":"DataFormatConversion.InvalidSchema","lastErrorMessage":"The
schema is invalid. The specified table has no columns.","attemptEndingTimestamp":1624542026951,"rawData":"xx","sequenceNumber":"xx","subSequenceNumber":null,"dataCatalogTable":{"catalogId":null,"databaseName":"db_name","tableName":"table_name","region":null,"versionId":"LATEST","roleArn":"xx"}}
The Terraform configuration for Glue Table, I am using, is as follows:
resource "aws_glue_catalog_table" "stream_format_conversion_table" {
name = "${var.resource_prefix}-parquet-conversion-table"
database_name = aws_glue_catalog_database.stream_format_conversion_db.name
table_type = "EXTERNAL_TABLE"
parameters = {
EXTERNAL = "TRUE"
"parquet.compression" = "SNAPPY"
}
storage_descriptor {
location = "s3://${element(split(":", var.bucket_arn), 5)}/"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
ser_de_info {
name = "my-stream"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
parameters = {
"serialization.format" = 1
}
}
columns {
name = "metadata"
type = "struct<tenantId:string,env:string,eventType:string,eventTimeStamp:timestamp>"
}
columns {
name = "eventpayload"
type = "struct<operation:string,timestamp:timestamp,user_name:string,user_id:int,user_email:string,batch_id:string,initiator_id:string,initiator_email:string,payload:string>"
}
}
}
What needs to change here?

I faced the "The schema is invalid. The specified table has no columns" with the following combination:
avro schema in Glue schema registry,
glue table created through console using "Add table from existing schema"
kinesis data firehose configured with Parquet conversion and referencing the glue table created from the schema registry.
It turns out that KDF is unable to read table's schema if table is created from existing schema. Table have to be created from scratch (in opposition to "Add table from existing schema") This isn't documented ... for now.

In addition to the answer from mberchon I found that the default generated policy for the Kinesis Delivery Stream did not include the necessary IAM permissions to actually read the schema.
I had to manually modify the IAM policy to include glue:GetSchema and glue:GetSchemaVersion.

Frustrated by having to manually define columns, wrote a little python tool that takes a pydantic class (could be made to work with json-schema too) and generated a json that can be used with terraform to create the table.
https://github.com/nanit/j2g
from pydantic import BaseModel
from typing import List
class Bar(BaseModel):
name: str
age: int
class Foo(BaseModel):
nums: List[int]
bars: List[Bar]
other: str
get converted to
{
"nums": "array<int>",
"bars": "array<struct<name:string,age:int>>",
"other": "string"
}
and can be used in terraform like so
locals {
columns = jsondecode(file("${path.module}/glue_schema.json"))
}
resource "aws_glue_catalog_table" "table" {
name = "table_name"
database_name = "db_name"
storage_descriptor {
dynamic "columns" {
for_each = local.columns
content {
name = columns.key
type = columns.value
}
}
}
}

Thought id post here as i was facing the same problem and found a workaround for this that appears to work.
As is stated above AWS do not allow you to use tables generated from existing schema to convert data types using Firehose. That said if you are using terraform you can create the table using the existing schema, then use the columns attribute from the first table created to create another table and then use that second table as the table for data type conversion in the firehose config, i can confirm this works.
tables terraform:
resource "aws_glue_catalog_table" "aws_glue_catalog_table_from_schema" {
name = "first_table"
database_name = "foo"
storage_descriptor {
schema_reference {
schema_id {
schema_arn = aws_glue_schema.your_glue_schema.arn
}
schema_version_number = aws_glue_schema.your_glue_schema.latest_schema_version
}
}
}
resource "aws_glue_catalog_table" "aws_glue_catalog_table_from_first_table" {
name = "second_table"
database_name = "foo"
storage_descriptor {
dynamic "columns" {
for_each = aws_glue_catalog_table.aws_glue_catalog_table_from_schema.storage_descriptor[0].columns
content {
name = columns.value.name
type = columns.value.type
}
}
}
}
firehose data format conversion configuration:
data_format_conversion_configuration {
output_format_configuration{
serializer {
parquet_ser_de {}
}
}
input_format_configuration {
deserializer {
hive_json_ser_de {}
}
}
schema_configuration {
database_name = aws_glue_catalog_table.aws_glue_catalog_table_from_first_table.database_name
role_arn = aws_iam_role.firehose_role.arn
table_name = aws_glue_catalog_table.aws_glue_catalog_table_from_first_table.name
}
}

Related

Terraform MalformedXML: The XML you provided was not well-formed for aws_s3_bucket_lifecycle_configuration

I really stuck today on the following error:
MalformedXML: The XML you provided was not well-formed
when applying aws_s3_bucket_lifecycle_configuration via Terraform using hashicorp/aws v4.38.0.
I wanted to set a rule that would expire files after 365 days with file size greater than 0 bytes for a my_prefix prefix so the definition of the resource looks like that:
resource "aws_s3_bucket_lifecycle_configuration" "my-bucket-lifecycle-configuration" {
depends_on = [aws_s3_bucket_versioning.my-bucket-versioning]
bucket = aws_s3_bucket.my_bucket.id
rule {
id = "my_prefix_current_version_config"
filter {
and {
prefix = "my_prefix/"
object_size_greater_than = 0
}
}
expiration {
days = 365
}
status = "Enabled"
}
}
Anyone has idea what's wrong with the above definition? :nerd_face:
Documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_lifecycle_configuration
Remark: the following definition can be applied without problem (no and block):
resource "aws_s3_bucket_lifecycle_configuration" "my-bucket-lifecycle-configuration" {
depends_on = [aws_s3_bucket_versioning.my-bucket-versioning]
bucket = aws_s3_bucket.my_bucket.id
rule {
id = "my_prefix_current_version_config"
filter {
prefix = "my_prefix/"
}
expiration {
days = 365
}
status = "Enabled"
}
}
From the documentation, you have to specify both the object size range (which I guess mean, you have to specify both object_size_greater_than and object_size_less_than) and prefix, for example:
filter {
and {
prefix = "my_prefix/"
object_size_greater_than = 0
object_size_less_than = 500
}
}

How to skip declaring values in root module (for_each loop)

I am trying to build a reusable module that creates multiple S3 buckets. Based on a condition, some buckets may have lifecycle rules, others do not. I am using a for loop in the lifecycle rule resource and managed to do it but not on 100%.
My var:
variable "bucket_details" {
type = map(object({
bucket_name = string
enable_lifecycle = bool
glacier_ir_days = number
glacier_days = number
}))
}
How I go through the map on the lifecycle resource:
resource "aws_s3_bucket_lifecycle_configuration" "compliant_s3_bucket_lifecycle_rule" {
for_each = { for bucket, values in var.bucket_details : bucket => values if values.enable_lifecycle }
depends_on = [aws_s3_bucket_versioning.compliant_s3_bucket_versioning]
bucket = aws_s3_bucket.compliant_s3_bucket[each.key].bucket
rule {
id = "basic_config"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
transition {
days = each.value["glacier_ir_days"]
storage_class = "GLACIER_IR"
}
transition {
days = each.value["glacier_days"]
storage_class = "GLACIER"
}
expiration {
days = 2555
}
noncurrent_version_transition {
noncurrent_days = each.value["glacier_ir_days"]
storage_class = "GLACIER_IR"
}
noncurrent_version_transition {
noncurrent_days = each.value["glacier_days"]
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 2555
}
}
}
How I WOULD love to reference it in the root module:
module "s3_buckets" {
source = "./modules/aws-s3-compliance"
#
bucket_details = {
"fisrtbucketname" = {
bucket_name = "onlythefisrtbuckettesting"
enable_lifecycle = true
glacier_ir_days = 555
glacier_days = 888
}
"secondbuckdetname" = {
bucket_name = "onlythesecondbuckettesting"
enable_lifecycle = false
}
}
}
So when I reference it like that, it cannot validate, because I am not setting values for both glacier_ir_days & glacier_days - understandable.
My question is - is there a way to check if the enable_lifecycle is set to false, to not expect values for these?
Currently, as a workaround, I am just setting zeroes for those and since the resource is not created if enable_lifecycle is false, it does not matter, but I would love it to be cleaner.
Thank you in advance.
The forthcoming Terraform v1.3 release will include a new feature for declaring optional attributes in an object type constraint, with the option of declaring a default value to use when the attribute isn't set.
At the time I'm writing this the v1.3 release is still under development and so not available for general use, but I'm going to answer this with an example that should work with Terraform v1.3 once it's released. If you wish to try it in the meantime you can experiment with the most recent v1.3 alpha release which includes this feature, though of course I would not recommend using it in production until it's in a final release.
It seems that your glacier_ir_days and glacier_days attributes are, from a modeling perspective, attribtues that are required when the lifecycle is enabled and not required when lifecycle is disabled.
I would suggest modelling that by placing these attributes in a nested object called lifecycle and implementing it such that the lifecycle resource is enabled when that attribute is set, and disabled when it is left unset.
The declaration would therefore look like this:
variable "s3_buckets" {
type = map(object({
bucket_name = string
lifecycle = optional(object({
glacier_ir_days = number
glacier_days = number
}))
}))
}
When an attribute is marked as optional(...) like this, Terraform will allow omitting it in the calling module block and then will quietly set the attribute to null when it performs the type conversion to make the given value match the type constraint. This particular declaration doesn't have a default value, but it's also possible to pass a second argument in the optional(...) syntax which Terraform will then use instead of null as the placeholder value when the attribute isn't specified.
The calling module block would therefore look like this:
module "s3_buckets" {
source = "./modules/aws-s3-compliance"
#
bucket_details = {
"fisrtbucketname" = {
bucket_name = "onlythefisrtbuckettesting"
lifecycle = {
glacier_ir_days = 555
glacier_days = 888
}
}
"secondbuckdetname" = {
bucket_name = "onlythesecondbuckettesting"
}
}
}
Your resource block inside the module will remain similar to what you showed, but the if clause of the for expression will test if the lifecycle object is non-null instead:
resource "aws_s3_bucket_lifecycle_configuration" "compliant_s3_bucket_lifecycle_rule" {
for_each = {
for bucket, values in var.bucket_details : bucket => values
if values.lifecycle != null
}
# ...
}
Finally, the references to the attributes would be slightly different to traverse through the lifecycle object:
transition {
days = each.value.lifecycle.glacier_days
storage_class = "GLACIER"
}

Value for Terraform Composer airflow_config_override secrets-backend_kwargs

I need to change, using Terraform, the default project_id in my Composer environment so that I can access secrets from another project. To do so, according to Terraform, I need the variable airflow_config_overrides. I guess I should have something like this:
resource "google_composer_environment" "test" {
# ...
config {
software_config {
airflow_config_overrides = {
secrets-backend = "airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend",
secrets-backend_kwargs = {"project_id":"9999999999999"}
}
}
}
}
The secrets-backend section-key seems to be working. On the other hand, secrets-backend_kwargs is returning the following error:
Inappropriate value for attribute "airflow_config_overrides": element "secrets-backend_kwargs": string required
It seems that the problem is that GCP expects a JSON format and Terraform requires a string. How can I get Terraform to provide it in the format needed?
You can convert a map such as {"project_id":"9999999999999"} into a JSON encoded string by using the jsonencode function.
So merging the example given in the google_composer_environment resource documentation with your config in the question you can do something like this:
resource "google_composer_environment" "test" {
name = "mycomposer"
region = "us-central1"
config {
software_config {
airflow_config_overrides = {
secrets-backend = "airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend",
secrets-backend_kwargs = jsonencode({"project_id":"9999999999999"})
}
pypi_packages = {
numpy = ""
scipy = "==1.1.0"
}
env_variables = {
FOO = "bar"
}
}
}
}

Make a list from data ec2_instance/'s?

I have a list of servers stored as a list in locals as
locals {
my_list = [
"server1",
"server2",
"server3",
"server4"
]
}
Can I fetch data for each server such as instace I'd etc using the locals above? Without defining individual data blocks for each server.
Can I then put those attributes in a list? Finally how would I consume it later for the example below which is for just one server. ( Below example is a cloud watch alarm dimension)
dimensions = {
instanceid = data.aws_instance.server1.instance_id
}
You can provide filter instance-id with your my_list (assuming server1 is instance-id):
data "aws_instances" "my_instances" {
filter {
name = "instance-id"
values = local.my_list
}
}
In case my_list contains instance names, then you can use:
data "aws_instance" "my_instances" {
for_each = toset(local.my_list)
instance_tags = {
Name = each.key
}
}
and to get the list of instance ids:
values(data.aws_instance.my_instances)[*].id

Athena querying fails while calling Glue virtual view with invalid JSON error

I am trying to create an Glue virtual view (table_type = "VIRTUAL_VIEW"), through which I can query through Athena. I am doing this through Terraform. I am able to successfully create the Glue view using below Terraform code, but then it fails as below error (INVALID_VIEW: Invalid view JSON: SELECT id, module, projectid FROM test_data."test_metrics" limit 10). I am not sure what I am doing wrong here, I guess it is it something to do with the way I am passing query to the VIEW, but tried few ways, but in vain. Any help would gretaly be appreciated :). Thank you.
locals {
database = "test_data"
query = "SELECT id, \nmodule, \nprojectid FROM ${local.database}.\"test_metrics\" limit 10"
}
resource "aws_glue_catalog_table" "story_view" {
database_name = local.database
name = "story_flow_details"
table_type = "VIRTUAL_VIEW"
view_original_text = "/* Presto View: ${base64encode(local.query)} */"
view_expanded_text = "/* Presto View */"
parameters = {
presto_view = "true"
comment = "Presto View"
}
storage_descriptor {
ser_de_info {
name = "JsonHiveSerDe1"
serialization_library = "org.apache.hive.hcatalog.data.JsonSerDe"
}
columns {
name = "id"
type = "string"
}
columns {
name = "module"
type = "string"
}
columns {
name = "projectid"
type = "string"
comment = ""
}
}
}
Your query has the following error(s):
INVALID_VIEW: Invalid view JSON: SELECT id, module, projectid FROM test_data."test_metrics" limit 10;
This query ran against the "test_data" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 79231ddc-d77c-4660-bf70-b84009a82082.