HIVE_METASTORE_ERROR expected 'STRING' but 'STRING' is found - amazon-athena

I've been unable to get any query to work against my AWS Glue Partitioned table. The error I'm getting is
HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error:
type expected at the position 0 of 'STRING' but 'STRING' is found.
(Service: null; Status Code: 0; Error Code: null; Request ID: null)
I've found one other thread that brings up the fact that the database name and table cannot have characters other than alphanumeric and underscores. So, I made sure the database name, table name and all column names adhere to this restriction. The only object that does not adhere to this restriction is my s3 bucket name which would be very difficult to change.
Here are the table definitions and parquet-tools dumps of the data.
AWS Glue Table Definition
{
"Table": {
"UpdateTime": 1545845064.0,
"PartitionKeys": [
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
}
],
"StorageDescriptor": {
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SortColumns": [],
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Name": "ser_de_info_system_admin_created",
"Parameters": {
"serialization.format": "1"
}
},
"BucketColumns": [],
"Parameters": {},
"Location": "s3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/",
"NumberOfBuckets": 0,
"StoredAsSubDirectories": false,
"Columns": [
{
"Comment": "Unique user ID",
"Type": "STRING",
"Name": "user_id"
},
{
"Comment": "Unique group ID",
"Type": "STRING",
"Name": "group_id"
},
{
"Comment": "Date and time the message was published",
"Type": "TIMESTAMP",
"Name": "call_time"
},
{
"Comment": "call_time year",
"Type": "INT",
"Name": "date_year"
},
{
"Comment": "call_time month",
"Type": "INT",
"Name": "date_month"
},
{
"Comment": "call_time day",
"Type": "INT",
"Name": "date_day"
},
{
"Comment": "Given name for user",
"Type": "STRING",
"Name": "given_name"
},
{
"Comment": "IANA time zone for user",
"Type": "STRING",
"Name": "time_zone"
},
{
"Comment": "Name that links to geneaology",
"Type": "STRING",
"Name": "family_name"
},
{
"Comment": "Email address for user",
"Type": "STRING",
"Name": "email"
},
{
"Comment": "RFC BCP 47 code set in this user's profile language and region",
"Type": "STRING",
"Name": "language"
},
{
"Comment": "Phone number including ITU-T ITU-T E.164 country codes",
"Type": "STRING",
"Name": "phone"
},
{
"Comment": "Date user was created",
"Type": "TIMESTAMP",
"Name": "date_created"
},
{
"Comment": "User role",
"Type": "STRING",
"Name": "role"
},
{
"Comment": "Provider dashboard preferences",
"Type": "STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING>",
"Name": "preferences"
},
{
"Comment": "Provider notification settings",
"Type": "STRUCT<digest_email:BOOLEAN>",
"Name": "notifications"
}
],
"Compressed": true
},
"Parameters": {
"classification": "parquet",
"parquet.compress": "SNAPPY"
},
"Description": "System wide admin_created messages",
"Name": "system_admin_created",
"TableType": "EXTERNAL_TABLE",
"Retention": 0
}
}
AWS Athena schema
CREATE EXTERNAL TABLE `system_admin_created`(
`user_id` STRING COMMENT 'Unique user ID',
`group_id` STRING COMMENT 'Unique group ID',
`call_time` TIMESTAMP COMMENT 'Date and time the message was published',
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day',
`given_name` STRING COMMENT 'Given name for user',
`time_zone` STRING COMMENT 'IANA time zone for user',
`family_name` STRING COMMENT 'Name that links to geneaology',
`email` STRING COMMENT 'Email address for user',
`language` STRING COMMENT 'RFC BCP 47 code set in this user\'s profile language and region',
`phone` STRING COMMENT 'Phone number including ITU-T ITU-T E.164 country codes',
`date_created` TIMESTAMP COMMENT 'Date user was created',
`role` STRING COMMENT 'User role',
`preferences` STRUCT<portal_welcome_done:BOOLEAN,weekend_digests:BOOLEAN,patients_hidden:BOOLEAN,last_announcement:STRING> COMMENT 'Provider dashboard preferences',
`notifications` STRUCT<digest_email:BOOLEAN> COMMENT 'Provider notification settings')
PARTITIONED BY (
`date_year` INT COMMENT 'call_time year',
`date_month` INT COMMENT 'call_time month',
`date_day` INT COMMENT 'call_time day')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://ph-data-lake-cududfs2z3xveg5t/curated/system/admin_created/'
TBLPROPERTIES (
'classification'='parquet',
'parquet.compress'='SNAPPY')
parquet-tools cat
role = admin
date_created = 2018-01-11T14:40:23.142Z
preferences:
.patients_hidden = false
.weekend_digests = true
.portal_welcome_done = true
email = foo.barr+123#example.com
notifications:
.digest_email = true
group_id = 5a5399df23a804001aa25227
given_name = foo
call_time = 2018-01-11T14:40:23.000Z
time_zone = US/Pacific
family_name = bar
language = en-US
user_id = 5a5777572060a700170240c3
parquet-tools schema
message spark_schema {
optional binary role (UTF8);
optional binary date_created (UTF8);
optional group preferences {
optional boolean patients_hidden;
optional boolean weekend_digests;
optional boolean portal_welcome_done;
optional binary last_announcement (UTF8);
}
optional binary email (UTF8);
optional group notifications {
optional boolean digest_email;
}
optional binary group_id (UTF8);
optional binary given_name (UTF8);
optional binary call_time (UTF8);
optional binary time_zone (UTF8);
optional binary family_name (UTF8);
optional binary language (UTF8);
optional binary user_id (UTF8);
optional binary phone (UTF8);
}

I ran into a similar PrestoException and the cause was using uppercase letters for the column type. Once I changed 'VARCHAR(10)' to 'varchar(10)' then it worked.

I was declaring the partition keys as fields in the table. I also ran into the Parquet vs Hive difference in TIMESTAMP and switched those to ISO8601 strings. From there, I pretty much gave up because Athena throws a schema error if all parquet files in the s3 buckets do not have the same schema as Athena. However, with optional fields and sparse columns this is guaranteed to happen

I also ran into this error and of course, the error message ended up telling me nothing about the actual problem. I had the exact same error as the original poster.
I am creating my glue tables via the python boto3 API and feeding it the column names, types, partition columns, and some other things. The problem:
here was my code that I was using to create the table:
import boto3
glu_clt = boto3.client("glue", region_name="us-east-1")
glue_clt.create_table(
DatabaseName=database,
TableInput={
"Name": table,
"StorageDescriptor": {
"Columns": table_cols,
"Location": table_location,
"InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
}
},
"PartitionKeys": partition_cols,
"TableType": "EXTERNAL_TABLE"
}
)
So I ended defining all the column names and types for the input Columns to the API. Then I was also giving it the column name and type for the input PartitionKeys in the API. When I browsed to the AWS console, I realized because I defined the partition column both in the Columns and PartitionKeys, it was defined twice in the table.
Interestingly enough, if you try to do this via the console, it will throw a more descriptive error letting you know that the column already exists (if you try adding a partition column that already exists in the table).
TO SOLVE:
I removed the partition columns and their types from the input Columns and instead just fed them through the PartitionKeys input so they wouldn't be put on the table twice. Soooooo frustrating that this was ultimately causing the same error message as the OP's when querying through Athena.

This could also be related to how you created your database (whether through CloudFormation or UI or CLI) or if you have any forbidden characters like '-'. We have hyphens in our database and table names and it renders much functionality useless.

Related

DMS replication to Kinesis omit certain fields

We have a use case where we have enabled a AWS DMS replication task which streams changes to our Aurora Postgres cluster to a Kinesis Data stream. The replication task is working as expected but the data that its sending to Kinesis Data Stream as json contains fields like metadata that we don't care about and would ideally like to omit them. Is there a way to do this without triggering a Lambda on KDS to remove the unwanted fields from the json?
I was looking at using table mappings config of the DMS task when KDS is the target, documentation here - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kinesis.html. The docs don't mention anything of this sort. Maybe I am missing something.
The current table mapping for my usecase is as follows -
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"rule-action": "include",
"object-locator": {
"schema-name": "public",
"table-name": "%"
}
},
{
"rule-type": "object-mapping",
"rule-id": "2",
"rule-name": "DefaultMapToKinesis",
"rule-action": "map-record-to-record",
"object-locator": {
"schema-name": "public",
"table-name": "testing"
}
}
]
}
The table testing only has two columns namely id and value of type varchar and decimal respectively.
The result I am getting in KDS is as follows -
{
"data": {
"id": "5",
"value": 1111.22
},
"metadata": {
"timestamp": "2022-08-23T09:32:34.222745Z",
"record-type": "data",
"operation": "insert",
"partition-key-type": "schema-table",
"schema-name": "public",
"table-name": "testing",
"transaction-id": 145524
}
}
As seen above we are only interested in the data key of the json.
Is there any way in DMS config or KDS to filter on the data portion of the json sent by DMS without involving any new infra like Lambda?

AWS DMS CDC - Only capture changed values not entire record? (Source RDS MySQL)

I have a DMS CDC task set (change data capture) from a MySQL database to stream to a Kinesis stream which a Lambda is connected to.
I was hoping to ultimately receive only the value that has changed and not on entire dump of the row, this way I know what column is being changed (at the moment it's impossible to decipher this without setting up another system to track changes myself).
Example, with the following mapping rule:
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "my-schema",
"table-name": "product"
},
"rule-action": "include",
"filters": []
},
and if I changed the name property of a record on the product table, I would hope to recieve a record like this:
{
"data": {
"name": "newValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}
However what I actually recieve is something like this:
{
"data": {
"name": "newValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}
As you can see when receiving this, it's impossible to decipher what property has changed without setting up additional systems to track this.
I've found another stackoverflow thread where someone is posting an issue because their CDC is doing what I want mine to do. Can anyone point me into the right direction to achieve this?
I found the answer after digging into AWS documentation some more.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kinesis.html#CHAP_Target.Kinesis.BeforeImage
Different source database engines provide different amounts of
information for a before image:
Oracle provides updates to columns only if they change.
PostgreSQL provides only data for columns that are part of the primary
key (changed or not).
MySQL generally provides data for all columns (changed or not).
I used the BeforeImageSettings on the task setting to include the original data with payloads.
"BeforeImageSettings": {
"EnableBeforeImage": true,
"FieldName": "before-image",
"ColumnFilter": "all"
}
While this still gives me the whole record, it give me enough data to work out what's changed without additional systems.
{
"data": {
"name": "newValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"before-image": {
"name": "oldValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}

Avro schema is not valid

I am trying to save this Avro schema. I get the message that the schema is not valid. Can someone share why its not valid?
{
"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "InvoiceNo",
"type": "int"
},
{
"name": "StockCode",
"type": "int"
},
{
"name": "Description",
"type": "long"
},
{
"name": "Quantity",
"type": "string"
},
{
"name": "InvoiceDate",
"type": "string"
},
{
"name": "UnitPrice",
"type": "string"
},
{
"name": "CustomerID",
"type": "string"
},
{
"name": "CustomerID",
"type": "string"
},
{
"name": "Country",
"type": "string"
}
],
"version": "1.0"
}
I'm a bit late to the party here but I think your issue is twofold.
(1) You haven't reformatted your columns to use the field names that Personalize want to see. Required fields for Interactions are USER_ID, ITEM_ID, and TIMESTAMP. (With TIMESTAMP being in Unix Epoch format.) See reference here.
(2) The five specified fields for Interactions are USER_ID, ITEM_ID, TIMESTAMP, EVENT_TYPE, and EVENT_VALUE. If you do include more fields, they will be considered metadata fields and you can only include up to 5 metadata fields. If you do include them AND the data type is "String" you, they must be specified as "categorical". See page 35 of the Personalize Developer's Guide for an example.
Hope this helps!

Using evolving avro schema for impala/hive storage

We have a JSON structure that we need to parse and use it in impala/hive.
Since the JSON structure is evolving, we thought we can use Avro.
We have planned to parse the JSON and format it as avro.
The avro formatted data can be used directly by impala. Lets say we store it in HDFS directory /user/hdfs/person_data/
We will keep putting avro serialized data in that folder as and we will be parsing input json one by one.
Lets say, we have a avro schema file for person (hdfs://user/hdfs/avro/scheams/person.avsc) like
{
"type": "record",
"namespace": "avro",
"name": "PersonInfo",
"fields": [
{ "name": "first", "type": "string" },
{ "name": "last", "type": "string" },
{ "name": "age", "type": "int" }
]
}
For this we will create table in hive by creating external table -
CREATE TABLE kst
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://user/hdfs/avro/scheams/person.avsc');
Lets say tomorrow we need to change this schema (hdfs://user/hdfs/avro/scheams/person.avsc) to -
{
"type": "record",
"namespace": "avro",
"name": "PersonInfo",
"fields": [
{ "name": "first", "type": "string" },
{ "name": "last", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "city", "type": "string" }
]
}
Can we keep putting the new seriliazied data in same HDFS directory /user/hdfs/person_data/ and impala/hive will still work by giving city column as NULL value old records?
Yes, you can, but for all new columns you should specify a default value:
{ "name": "newField", "type": "int", "default":999 }
or mark them as nullable:
{ "name": "newField", "type": ["null", "int"] }

Query amazon ec2 attributes from inside instance for checking pricing

Amazon EC2 pricing api provides different attributes for each type of pricing, how am i able to know under which pricing my ec2 instance is running. Because in pricing api i.e. json file, amazon provides few attributes and out of these attributes i am only able to fetch instanceType from inside of an instance. how to get others?
[
{
"TDVRYW6K68T4XJHJ.JRTCKXETXF": {
"effectiveDate": "2016-01-01T00:00:00Z",
"offerTermCode": "JRTCKXETXF",
"priceDimensions": {
"TDVRYW6K68T4XJHJ.JRTCKXETXF.6YS6EN2CT7": {
"appliesTo": [],
"beginRange": "0",
"description": "$4.900 per On Demand Linux hs1.8xlarge Instance Hour",
"endRange": "Inf",
"pricePerUnit": {
"USD": "4.9000000000"
},
"rateCode": "TDVRYW6K68T4XJHJ.JRTCKXETXF.6YS6EN2CT7",
"unit": "Hrs"
}
},
"sku": "TDVRYW6K68T4XJHJ",
"termAttributes": {}
},
"attributes": {
"clockSpeed": "2 GHz",
"currentGeneration": "No",
"instanceFamily": "Storage optimized",
"instanceType": "hs1.8xlarge",
"licenseModel": "No License required",
"location": "EU (Ireland)",
"locationType": "AWS Region",
"memory": "117 GiB",
"networkPerformance": "10 Gigabit",
"operatingSystem": "Linux",
"operation": "RunInstances",
"physicalProcessor": "Intel Xeon E5-2650",
"preInstalledSw": "NA",
"processorArchitecture": "64-bit",
"servicecode": "AmazonEC2",
"storage": "24 x 2000",
"tenancy": "Shared",
"usagetype": "EU-BoxUsage:hs1.8xlarge",
"vcpu": "17"
}
}
]
1) find your instance size and AZ. For example
[ec2-user#ip-10-50-1-171 temp]$ ec2-metadata |grep placement
placement: eu-west-1a
[ec2-user#ip-10-50-1-171 temp]$ ec2-metadata |grep instance-type
instance-type: t2.micro
2) pull the correct file for the pricing of the EC2, for example at the moment it is
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/index.json
3) within this file there are "products". So for instance within the products find a t2.micro for eu-west
"SYEPG42MVWFMUBT6" : {
"sku" : "SYEPG42MVWFMUBT6",
"productFamily" : "Compute Instance",
"attributes" : {
"servicecode" : "AmazonEC2",
"location" : "EU (Ireland)",
"locationType" : "AWS Region",
"instanceType" : "t2.micro",
"instanceFamily" : "General purpose",
"vcpu" : "1",
"physicalProcessor" : "Intel Xeon Family",
"clockSpeed" : "Up to 3.3 GHz",
"memory" : "1 GiB",
"storage" : "EBS only",
"networkPerformance" : "Low to Moderate",
"processorArchitecture" : "32-bit or 64-bit",
"tenancy" : "Shared",
"operatingSystem" : "SUSE",
"licenseModel" : "No License required",
"usagetype" : "EU-BoxUsage:t2.micro",
"operation" : "RunInstances:000g",
"preInstalledSw" : "NA",
"processorFeatures" : "Intel AVX; Intel Turbo"
}
},
Note the SKU for this product
4) next find the "terms" section in the json file. There are sections for "OnDemand" and "Reserved". In "OnDemand" the SKU for the product of interest (in the example above SYEPG42MVWFMUBT6) is mentioned once. In the "Reserved" there are several entries with different terms
If you need to do all these steps programmatically you'd have to use either a shell script and a tool like jq or a library for json processing like the one included with python
I know this question is old, and I don't have any definitive proof (someone please correct me if its wrong) But through experimenting I've found that to get your specific running instance cost, you really cannot use a combination of the EC2 Metadata and the https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/price-changes.html API.
This is because the information returned from the price API is for current offers available in the near term for instances to be purchased not your specific running instance.
Specifically what you would want to find is the Ratecode associated with your specific instance, and this is not found through calling DescribeInstances. You can apply a series of filters to get a pretty good guess, as to what the instance cost is likely for a specific instance running in your account. However, I am unable to uniquely define a specific instance even with some specific filters like the following
{
Filters: [
{
Type: 'TERM_MATCH',
Field: 'ServiceCode',
Value: 'AmazonEC2',
},
{
Type: 'TERM_MATCH',
Field: 'regionCode',
Value: 'us-east-1',
},
{
Type: 'TERM_MATCH',
Field: 'instanceType',
Value: 't3.medium',
},
{
Type: 'TERM_MATCH',
Field: 'marketoption',
Value: 'OnDemand',
},
{
Type: 'TERM_MATCH',
Field: 'operatingSystem',
Value: 'Linux',
},
{
Type: 'TERM_MATCH',
Field: 'tenancy',
Value: 'Shared',
},
],
FormatVersion: 'aws_v1',
NextToken: null,
ServiceCode: 'AmazonEC2',
};
(The above filter returns like ~50 different price offerings when used to query the GetProductsAPI )
If its any help here is some code I've been using to muck around
I haven't tested it but it might just be easier to Crawl the pricing page
EDIT:
I worked with AWS support a bit and found a workable solution set of filters. Copying the response from AWS here:
I understand that you want to use aws price list query API to output the pricing of onDemand t3.medium instance, however it is throwing results for lot of instances of same instance type instead of getting output for the exact ec2 instance your queried for.
I was able to reproduce the same behavior from my end when I used the same filters you provided over an aws cli get-products request1. After digging into this a bit, I was able to substantially bring down the search result using the below filters with the help of a guide2.
Command:
aws pricing get-products --filters file://filters2.json --format-version aws_v1 --service-code AmazonEC2
Filter json file:
[
{
"Type": "TERM_MATCH",
"Field": "ServiceCode",
"Value": "AmazonEC2"
},
{
"Type": "TERM_MATCH",
"Field": "regionCode",
"Value": "us-east-1"
},
{
"Type": "TERM_MATCH",
"Field": "instanceType",
"Value": "t3.medium"
},
{
"Type": "TERM_MATCH",
"Field": "marketoption",
"Value": "OnDemand"
},
{
"Type": "TERM_MATCH",
"Field": "operatingSystem",
"Value": "Linux"
},
{
"Type": "TERM_MATCH",
"Field": "tenancy",
"Value": "Shared"
},
{
"Type": "TERM_MATCH",
"Field": "preInstalledSw",
"Value": "NA"
},
{
"Type": "TERM_MATCH",
"Field": "licenseModel",
"Value": "No License required"
},
{
"Type": "TERM_MATCH",
"Field": "capacitystatus",
"Value": "Used"
}
]
The following were the results in a nutshell:
On Demand Instance price for the instance matched in filter.
Reserved Instances:
2.1: Term: 1 year or 3 year
2.2: Upfront: none, partial or all
2.3: Convertable: standard or convertable
I understand that you have applied a filter that matches key value 'marketoption:OnDemand', but in the output there is a seperate category called "terms" within "serviceCode" key. The terms then is then a nested array containing different keys for 'OnDemand' and 'Reserved' etc which would not be filtered by the above filters normally.
As for a workaround to filter out just the pricing of On Demand instance, I am afraid you would have to create custon json query to filter out only the values of the On demand instance returned from the output of the updated script.