How can I work around the following error in Amazon Athena?
HIVE_INVALID_METADATA: com.facebook.presto.hive.DataCatalogException: Error: : expected at the position 8 of 'struct<x-amz-request-id:string,action:string,label:string,category:string,when:string>' but '-' is found. (Service: null; Status Code: 0; Error Code: null; Request ID: null)
When looking at position 8 in the database table connected to Athena generated by AWS Glue, I can see that it has a column named attributes with a corresponding struct data type:
struct <
x-amz-request-id:string,
action:string,
label:string,
category:string,
when:string
>
My guess is that the error occurs because the attributes field is not always populated (c.f. the _session.start event below) and does not always contain all fields (e.g. the DocumentHandling event below does not contain the attributes.x-amz-request-id field). What is the appropriate way to address this problem? Can I make a column optional in Glue? Can (should?) Glue fill the struct with empty strings? Other options?
Background: I have the following backend structure:
Amazon PinPoint Analytics collects metrics from my application.
The PinPoint event stream has been configured to forward the events to an Amazon Kinesis Firehose delivery stream.
Kinesis Firehose writes data to S3
Use AWS Glue to crawl S3
Use Athena to write queries based on the databases and tables generated by AWS Glue
I can see PinPoint events successfully being added to json files in S3, e.g.
First event in a file:
{
"event_type": "_session.start",
"event_timestamp": 1524835188519,
"arrival_timestamp": 1524835192884,
"event_version": "3.1",
"application": {
"app_id": "[an app id]",
"cognito_identity_pool_id": "[a pool id]",
"sdk": {
"name": "Mozilla",
"version": "5.0"
}
},
"client": {
"client_id": "[a client id]",
"cognito_id": "[a cognito id]"
},
"device": {
"locale": {
"code": "en_GB",
"country": "GB",
"language": "en"
},
"make": "generic web browser",
"model": "Unknown",
"platform": {
"name": "macos",
"version": "10.12.6"
}
},
"session": {
"session_id": "[a session id]",
"start_timestamp": 1524835188519
},
"attributes": {},
"client_context": {
"custom": {
"legacy_identifier": "50ebf77917c74f9590c0c0abbe5522d2"
}
},
"awsAccountId": "672057540201"
}
Second event in the same file:
{
"event_type": "DocumentHandling",
"event_timestamp": 1524835194932,
"arrival_timestamp": 1524835200692,
"event_version": "3.1",
"application": {
"app_id": "[an app id]",
"cognito_identity_pool_id": "[a pool id]",
"sdk": {
"name": "Mozilla",
"version": "5.0"
}
},
"client": {
"client_id": "[a client id]",
"cognito_id": "[a cognito id]"
},
"device": {
"locale": {
"code": "en_GB",
"country": "GB",
"language": "en"
},
"make": "generic web browser",
"model": "Unknown",
"platform": {
"name": "macos",
"version": "10.12.6"
}
},
"session": {},
"attributes": {
"action": "Button-click",
"label": "FavoriteStar",
"category": "Navigation"
},
"metrics": {
"details": 40.0
},
"client_context": {
"custom": {
"legacy_identifier": "50ebf77917c74f9590c0c0abbe5522d2"
}
},
"awsAccountId": "[aws account id]"
}
Next, AWS Glue has generated a database and a table. Specifically, I see that there is a column named attributes that has the value of
struct <
x-amz-request-id:string,
action:string,
label:string,
category:string,
when:string
>
However, when I attempt to Preview table from Athena, i.e. execute the query
SELECT * FROM "pinpoint-test"."pinpoint_testfirehose" limit 10;
I get the error message described earlier.
Side note, I have tried to remove the attributes field (by editing the database table from Glue), but that results in Internal error when executing the SQL query from Athena.
This is a known limitation. Athena table and database names allow only underscore special characters#
Athena table and database names cannot contain special characters, other than underscore (_).
Source: http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
Use tick (`) when table name has - in the name
Example:
SELECT * FROM `pinpoint-test`.`pinpoint_testfirehose` limit 10;
Make sure you select "default" database on the left pane.
I believe the problem is your struct element name: x-amz-request-id
The "-" in the name.
I'm currently dealing with a similar issue since my elements in my struct have "::" in the name.
Sample data:
some_key: {
"system::date": date,
"system::nps_rating": 0
}
Glue derived struct Schema (it tried to escape them with ):
struct <
system\:\:date:String
system\:\:nps_rating:Int
>
But that still gives me an error in Athena.
I don't have a good solution for this other than changing Struct to STRING and trying to process the data that way.
Related
I've got logstash processing logs and uploading to an opensearch instance running on AWS as a service.
I've added a geoip filter to my logstash to process IPs into geographic data. According to the docs, the geoip filter should generate a location field that contains lon and lat and that should be recognised as a geo_point type which can then be used to populate map visualisations.
I've been trying for a couple of hours now but opensearch always splits the location field into the numbers location.lon and location.lat instead of just recognising location as geo_point, hence I cannot use it for map visualisations.
Here is my logstash config:
input {
file {
...
codec => json {
target => "[log_message]"
}
}
}
filter {
...
geoip {
source => "[log_message][forwarded_ip_address]"
}
}
output {
...
opensearch {
...
ecs_compatibility => disabled
}
}
The template on my opensearch instance is the standard one, so it does contain this:
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
I am not sure if this is relevant but AWS OpenSearch requires the ECS compatibility to be set as disabled, which I did.
Has somebody managed to do this successfully on AWS OpenSearch?
Have you tried to set the location field as geo_point type in the index before ingesting the data? I don't think opensearch detects the geo_point type automatically.
My goal is to have an AWS System Manager Document download a script from S3 and then run that script on the selected EC2 instance. In this case, it will be a Linux OS.
According to AWS documentation for aws:downloadContent the sourceInfo Input is of type StringMap.
The example code looks like this:
{
"schemaVersion": "2.2",
"description": "aws:downloadContent",
"parameters": {
"sourceType": {
"description": "(Required) The download source.",
"type": "String"
},
"sourceInfo": {
"description": "(Required) The information required to retrieve the content from the required source.",
"type": "StringMap"
}
},
"mainSteps": [
{
"action": "aws:downloadContent",
"name": "downloadContent",
"inputs": {
"sourceType":"{{ sourceType }}",
"sourceInfo":"{{ sourceInfo }}"
}
}
]
}
This code assumes you will run this document by hand (console or CLI) and then enter the sourceInfo in the parameter. When running this document by hand, anything entered in the parameter (an S3 URL) isn't accepted. However, I'm not trying to run this by hand, but rather programmatically and I want to hard code the S3 URL into sourceInfo in mainSteps.
AWS does give an example of syntax that looks like this:
{
"path": "https://s3.amazonaws.com/aws-executecommand-test/powershell/helloPowershell.ps1"
}
I've coded the document action in mainSteps like this:
{
"action": "aws:downloadContent",
"name": "downloadContent",
"inputs": {
"sourceType": "S3",
"sourceInfo":
{
"path": "https://s3.amazonaws.com/bucketname/folder1/folder2/script.sh"
},
"destinationPath": "/tmp"
}
},
However, it doesn't seem to work and I receive this error:
invalid format in plugin properties map[sourceInfo:map[path:https://s3.amazonaws.com/bucketname/folder1/folder2/script.sh] sourceType:S3];
error json: cannot unmarshal object into Go struct field DownloadContentPlugin.sourceInfo of type string
Note: I have seen this post that references how to format it for Windows. I did try it, didn't work and doesn't seem relevant to my Linux needs.
So my questions are:
Do you need a parameter for sourceInfo of type StringMap - something that won't be used within the aws:downloadContent {{ sourceInfo }} mainSteps?
How do you properly format the aws:downloadContent action sourceInfo StringMap in mainSteps?
Thank you for your effort in advance.
I had similar issue as I did not want anyone to type the stuff when running. So I added a default to the download content
"sourceInfo": {
"description": "(Required) Blah.",
"type": "StringMap",
"displayType": "textarea",
"default": {
"path": "https://mybucket-public.s3-us-west-2.amazonaws.com/automation.sh"
}
}
I'm trying to get AWS IoT rules trigger my actions. But the documentation is pretty poor. For some reason documentation think i'll have json payloads with nesting level=1 top.
Example of my JSON payload:
"state": {
"reported": {
"movement": "yes"
}
}
}
Query i'm using inside a rule
SELECT * FROM '$aws/things/thing-name/shadow/update/accepted' WHERE state.reported.movement="yes"
Documentation, i'm using: http://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-where.html
has information just about flat JSON object, i did try to use state.reported.movement, reported.movement, just movement and looks like none of them works
Okay, answer to my question is: AWS IoT Supports nested properties in WHERE statements. For some reason it didn't work when i did create a rule. Maybe SNS topic had some delay with delivering messages.
My state of the thing, coming to AWS IoT accepted topic looks like that:
{
"state": {
"reported": {
"system": "armed",
"movement": "yes"
}
},
"metadata": {
"reported": {
"system": {
"timestamp": 1509207282
},
"movement": {
"timestamp": 1509207282
}
}
},
"version": 618,
"timestamp": 1509207282,
"clientToken": "xxxxxxxx"
}
And to query that JSON i'm using following query:
SELECT * FROM '$aws/things/myThing/shadow/update/accepted' WHERE state.reported.movement="yes" and state.reported.system="armed"
I am trying to update DynamoDB and I send JSON data from Rasperry PI or MQTT Client, but when I look to CloudWatch I see below error message.
EVENT:DynamoActionFailure TOPICNAME:iotbutton/test CLIENTID:MQTT_FX_Client MESSAGE:Dynamo Insert record failed. The error received was Attribute name must not be null or empty. Message arrived on: iotbutton/test, Action: dynamo, Table: myTable_IoT, HashKeyField: SerialNumber, HashKeyValue: ABCDEFG12345, RangeKeyField: Some(ClickType), RangeKeyValue: SINGLE
I am using the AWS IoT Tutorial (http://docs.aws.amazon.com/iot/latest/developerguide/iot-dg.pdf), The Seccion: Creating a DynamoDB Rule.
The data I send to the IoT platform is:
{
"serialNumber" : "ABCDEFG12345",
"clickType" : "SINGLE",
"batteryVoltage" : "5v USB"
}
topic: iotbutton/ABCDEFG12345
Does anyone come across this error and aware of any solution?
Thanks, regards.
This is the message the CloudWatch logs showed when a tried doing this:
{
"timestamp": "2019-01-28 21:26:16.363",
"logLevel": "ERROR",
"traceId": "9e3ff9b0-fcdf-d8ae-e8a8-4b7a24902405",
"accountId": "xxx",
"status": "Failure",
"eventType": "RuleExecution",
"clientId": "basicPubSub",
"topicName": "xxx/r117",
"ruleName": "devCompDynamoDB",
"ruleAction": "DynamoAction",
"resources":
{
"ItemRangeKeyValue": "SINGLE",
"IsPayloadJSON": "true",
"ItemHashKeyField": "SerialNumber",
"Operation": "Insert",
"ItemRangeKeyField": "ClickType",
"Table": "TestIoTDataTable",
"ItemHashKeyValue": "ABCDEFG12345"
},
"principalId": "xx",
"details": "Attribute name must not be null or empty"
}
To Fix it I edited the DynamoDB Rule in the IoT Web Console and I added a payload column in the "Write message data to this column" field.
I successfully managed to get a data pipeline to transfer data from a set of tables in Amazon RDS (Aurora) to a set of .csv files in S3 with a "copyActivity" connecting the two DataNodes.
However, I'd like the .csv file to have the name of the table (or view) that it came from. I can't quite figure out how to do this. I think the best approach is to use an expression the filePath parameter of the S3 DataNode.
But, I've tried #{table}, #{node.table}, #{parent.table}, and a variety of combinations of node.id and parent.name without success.
Here's a couple of JSON snippets from my pipeline:
"database": {
"ref": "DatabaseId_abc123"
},
"name": "Foo",
"id": "DataNodeId_xyz321",
"type": "MySqlDataNode",
"table": "table_foo",
"selectQuery": "select * from #{table}"
},
{
"schedule": {
"ref": "DefaultSchedule"
},
"filePath": "#{myOutputS3Loc}/#{parent.node.table.help.me.here}.csv",
"name": "S3_BAR_Bucket",
"id": "DataNodeId_w7x8y9",
"type": "S3DataNode"
}
Any advice you can provide would be appreciated.
I see that you have #{table} (did you mean #{myTable}?). If you are using a parameter to pass the name of the DB table, you can use that in the S3 filepath as well like this:
"filePath": "#{myOutputS3Loc}/#{myTable}.csv",