AWS Athena struct not parsing JSON string - amazon-athena

I am using AWS Athena to do some queries on AWS CloudTrail data object log entries.
The first few fields in a typical log entry look like this (pretty-printed for clarity):
{
"Records": [
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSAccount",
"principalId": "",
"accountId": "ANONYMOUS_PRINCIPAL"
},
"eventTime": "2021-03-23T14:04:38Z",
"eventSource": "s3.amazonaws.com",
"eventName": "GetObject",
"awsRegion": "us-east-1",
"sourceIPAddress": "12.34.45.56",
"userAgent": "[Amazon CloudFront]",
"requestParameters": {
"bucketName": "mybucket",
"Host": "mybucket.s3.amazonaws.com",
"key": "bin/some/path/to/a/file"
},
"responseElements": null,
...
The AWS CloudTrail console will create a standard table to query these entries. The table is defined as this:
CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs`(
`eventversion` string COMMENT 'from deserializer',
`useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventsource` string COMMENT 'from deserializer',
`eventname` string COMMENT 'from deserializer',
`awsregion` string COMMENT 'from deserializer',
`sourceipaddress` string COMMENT 'from deserializer',
`useragent` string COMMENT 'from deserializer',
`errorcode` string COMMENT 'from deserializer',
`errormessage` string COMMENT 'from deserializer',
`requestparameters` string COMMENT 'from deserializer',
`responseelements` string COMMENT 'from deserializer',
`additionaleventdata` string COMMENT 'from deserializer',
`requestid` string COMMENT 'from deserializer',
`eventid` string COMMENT 'from deserializer',
`resources` array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer',
`eventtype` string COMMENT 'from deserializer',
`apiversion` string COMMENT 'from deserializer',
`readonly` string COMMENT 'from deserializer',
`recipientaccountid` string COMMENT 'from deserializer',
`serviceeventdetails` string COMMENT 'from deserializer',
`sharedeventid` string COMMENT 'from deserializer',
`vpcendpointid` string COMMENT 'from deserializer')
COMMENT 'CloudTrail table for adafruit-circuit-python-logs bucket'
ROW FORMAT SERDE
'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/AWSLogs/12345678901234/CloudTrail'
TBLPROPERTIES (
'classification'='cloudtrail',
'transient_lastDdlTime'='1616514617')
Note that useridentity is described as a struct, but requestParameters is a string. I would like to use the struct feature to preparse requestParameters, so I tried this:
CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs2`(
`eventversion` string COMMENT 'from deserializer',
`useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventsource` string COMMENT 'from deserializer',
`eventname` string COMMENT 'from deserializer',
`awsregion` string COMMENT 'from deserializer',
`sourceipaddress` string COMMENT 'from deserializer',
`useragent` string COMMENT 'from deserializer',
`errorcode` string COMMENT 'from deserializer',
`errormessage` string COMMENT 'from deserializer',
`requestparameters` struct<`bucketName`:string, `Host`:string, `key`:string> COMMENT 'THIS IS NEW',
...[rest same as above]
The table is created, but trying to do a simple query using it ("Preview Table") gives this error:
GENERIC_INTERNAL_ERROR: parent builder is null
What's wrong with my attempt to use struct on requestparameters? It seems no different in terms of JSON as to what's going on with useridentity.

You should use the json Serializer/Deserializer instead:
...
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( 'paths'='<LIST OF COLUMNS>' )
...
See the docs: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html#hive-json-serde

Related

MongoDB regex query with property that can be null

Writing a MongoDB regex query in Spring with a field that can be null.
I want to query documents by name and phone:
Query(value = "{ {'name' : {$regex:?0,$options:'i'}},
{'phone' : {$regex:?1,$options:'i'}} }")
Document findByFullNameOrPhone(String fullName, String phone);
The value I'm passing through the query for phone is ".*" in an attempt to match everything.
It works but the problem is phone is a field that can be null. If the document has no phone value it's not included in the query result. Is it possible to use this query to find all documents in the database, even if the document does not have a value for phone?
Just add null check.
Query(value = "{
'name': {'$regex': ?0,'options': 'i'},
$or: [
{'phone': null},
{'phone': {'$regex': ?1,'options': 'i'}}
]
}")
Document findByFullNameOrPhone(String fullName, String phone);
Demo

Glue AWS creating a data catalog table on boto3 python

I have been trying to create a table within our data catalog using the python API. Following the documentation posted here and here for the API. I can understand how that goes. Nevertheless, I need to undestand how to declare a field structure when I create the table because when I take a look on the Storage Definition for the table here there is any explanation about how should I define this type of column for my table. In addition. I dont see the classification property for the table where is covered. Maybe on properties? I have used the boto3 documentation for this sample
code:
import boto3
client = boto3.client(service_name='glue', region_name='us-east-1')
response = client.create_table(
DatabaseName='dbname',
TableInput={
'Name': 'tbname',
'Description': 'tb description',
'Owner': 'I'm',
'StorageDescriptor': {
'Columns': [
{ 'Name': 'agents', 'Type': 'struct','Comment': 'from deserializer' },
{ 'Name': 'conference_sid', 'Type': 'string','Comment': 'from deserializer' },
{ 'Name': 'call_sid', 'Type': 'string','Comment': 'from deserializer' }
] ,
'Location': 's3://bucket/location/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'SerdeInfo': { 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'}
},
'TableType' : "EXTERNAL_TABLE"} )
Found this post because I ran into the same issue and eventually found the solution so you could do as type:
array<struct<id:string,timestamp:bigint,message:string>>
I found this "hint" while using the AWS Console and clicking on a data type of an existing table created via a Crawler. It hints:
An ARRAY of scalar type as a top - level column.
ARRAY <STRING>
An ARRAY with elements of complex type (STRUCT).
ARRAY < STRUCT <
place: STRING,
start_year: INT
>>
An ARRAY as a field (CHILDREN) within a STRUCT. (The STRUCT is inside another ARRAY, because it is rare for a STRUCT to be a top-level column.)
ARRAY < STRUCT <
spouse: STRING,
children: ARRAY <STRING>
>>
A STRUCT as the element type of an ARRAY.
ARRAY < STRUCT <
street: STRING,
city: STRING,
country: STRING
>>

Amazon CLI, route 53, TXT error

I'm trying to create a TXT record in Route53 via the Amazon CLI for DNS-01 validation. Seems like I'm very close but possibly running into a CLI issue (or a formatting issue I don't see). As you can see, it's complaining about a value that should be in quotes, but is indeed in quotes already...
Command Line:
aws route53 change-resource-record-sets --hosted-zone-id ID_HERE --change-batch file://c:\dev\test1.json
JSON File:
{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "DOMAIN_NAME_HERE",
"Type": "TXT",
"TTL": 60,
"ResourceRecords": [
{
"Value": "test"
}
]
}
}
]
}
Error:
An error occurred (InvalidChangeBatch) when calling the ChangeResourceRecordSets operation: Invalid Resource Record: FATAL problem: InvalidCharacterString (Value should be enclosed in quotation marks) encountered with 'test'
Those quotes are the JSON quotes, and those are not the quotes they're looking for.
The JSON string "test" encodes the literal value test.
The JSON string "\"test\"" encodes the literal value "test".
(This is because in JSON, a literal " in a string is escaped with a leading \).
It sounds like they want actual, literal quotes included inside the value, so if you're building this JSON manually you probably want the latter: "Value": "\"test\"".
A JSON library should do this for you if you passed it the value with the leading and trailing " included.

How do you export a Map data type column on DynamoDB to S3 with JSON data type using HiveQL on EMR?

There are records that map data type on DynamoDB, I want to export these records to S3 with JSON data format using HiveQL on EMR.
How do you do this one? Is it possible?
I read the following documentaion, but that I wanted information was nothing.
DynamoDB DataFormat Documentation: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataFormat.html
Hive Command Examples for Exporting... Documentation: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html
I tried the following steps:
Create a table on DynamoDB
TableName: DynamoDBTable1
HashKey: user_id
Insert two records to DynamoDB
# record1
user_id: "0001"
json: {"key1": "value1", "key2": "value2"}
# record2
user_id: "0001"
json: {"key1": "value1", "key2": "value2"}
Create a table on EMR from DynamoDB
CREATE EXTERNAL TABLE test (user_id string, json map<string, string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "DynamoDBTable",
"dynamodb.column.mapping" = "user_id:user_id,json:json");
Export records to S3
INSERT OVERWRITE DIRECTORY 's3://some-bucket/exports/' select json from test where user_id = '0001';
Confirm the S3 bucket, but the exported data is not JSON format...
# Expected
[
{"key1": "value1", "key2": "value2"},
{"key1": "value1", "key2": "value2"}
]
# Actual
key1^C{"s":"value1"}^Bkey2^C{"s":"value2"}
key1^C{"s":"value1"}^Bkey2^C{"s":"value2"}
The following DynamoDB data types are not supported by the DynamoDBStorageHandler class, so they cannot be used with dynamodb.column.mapping:
Map,
List,
Boolean,
Null

AWS CloudSearch cannot upload documents

I am new to AWS and CloudSearch. I have written a very simple app which is to upload docx document (already use cs-import-document to convert to JSON format) to my seach domain.
Code is very straightforward as this:
using (var searchdomainclient = new AmazonCloudSearchDomainClient("http://search-xxxxx-xysjxyuxjxjxyxj.ap-southeast-2.cloudsearch.amazonaws.com"))
{
// Test to upload doc
var uploaddocrequest = new UploadDocumentsRequest()
{
FilePath = #"c:\temp\testsearch.sdf", //docx to JSON already
ContentType = ContentType.ApplicationJson
};
var uploadresult = searchdomainclient.UploadDocuments(uploaddocrequest);
}
However the exception I got is: "Root element is missing."
Here is the JSON stuff in the sdf file I want to upload:
[{
"type": "add",
"id": "c:_temp_testsearch.docx",
"fields": {
"template": "Normal.dotm",
"application_name": "Microsoft Office Word",
"paragraph_count": "1",
"resourcename": "testsearch.docx",
"date": "2014-07-28T23:52:00Z",
"xmptpg_npages": "1",
"page_count": "1",
"publisher": "",
"creator": "John Smith",
"creation_date": "2014-07-28T23:52:00Z",
"content": "Test5",
"author": "John Smith",
"last_modified": "2014-07-29T04:22:00Z",
"revision_number": "3",
"line_count": "1",
"application_version": "15.0000",
"last_author": "John Smith",
"character_count": "5",
"character_count_with_spaces": "5",
"content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
}]
So what's wrong with my approach?
Thanks heaps!
P.S. I can manually upload docx doc to that search doamin and use C# code to apply search.
============= Update 2014-08-04 ===================
I am not sure whether it is related to this or not. In the stack trace I found it tries to parse as XML file rather than JSON. But from my code I already set ContentType = JASON, but it seems no effect.
at System.Xml.XmlTextReaderImpl.ThrowWithoutLineInfo(String res)
at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
at Amazon.Runtime.Internal.Transform.XmlUnmarshallerContext.Read()
at Amazon.Runtime.Internal.Transform.ErrorResponseUnmarshaller.Unmarshall(XmlUnmarshallerContext context)
at Amazon.Runtime.Internal.Transform.JsonErrorResponseUnmarshaller.Unmarshall(JsonUnmarshallerContext context)
at Amazon.CloudSearchDomain.Model.Internal.MarshallTransformations.UploadDocumentsResponseUnmarshaller.UnmarshallException(JsonUnmarshallerContext context, Exception innerException, HttpStatusCode statusCode)
at Amazon.Runtime.Internal.Transform.JsonResponseUnmarshaller.UnmarshallException(UnmarshallerContext input, Exception innerException, HttpStatusCode statusCode)
at Amazon.Runtime.AmazonWebServiceClient.HandleHttpWebErrorResponse(AsyncResult asyncResult, WebException we)
at Amazon.Runtime.AmazonWebServiceClient.getResponseCallback(IAsyncResult result)
at Amazon.Runtime.AmazonWebServiceClient.endOperation[T](IAsyncResult result)
at Amazon.CloudSearchDomain.AmazonCloudSearchDomainClient.EndUploadDocuments(IAsyncResult asyncResult)
at Amazon.CloudSearchDomain.AmazonCloudSearchDomainClient.UploadDocuments(UploadDocumentsRequest request)
at Amazon.CloudSearchDomain.Model.Internal.MarshallTransformations.UploadDocumentsResponseUnmarshaller.UnmarshallException(JsonUnmarshallerContext context, Exception innerException, HttpStatusCode statusCode)
Your document id contains invalid characters (period and colon). From https://aws.amazon.com/articles/8871401284621700 :
The ID must be unique across all of the documents you upload to the
domain and can contain the following characters: a-z (lowercase
letters), 0-9, and the underscore character (_). Document IDs must
start with a letter or number and can be up to 64 characters long.
It is also unclear what endpoint you're posting to but you may also have a problem there.
I had exactly the same exception with SDK version 2.2.2.0. When I had updated SDK to version 2.2.2.1 exception went away.