Why doesn't my Kinesis Analytics Application Schema Discovery work?

Why doesn't my Kinesis Analytics Application Schema Discovery work? - amazon-web-services

I am sending comma-separated data to my kinesis stream, and I want my kinesis analytics app to recognize that there are two columns (both bigints). But when I populate my stream with some records and click "Discover Schema", it always gives me a schema of one column! Here's a screenshot:
I have tried many different delimiters to indicate columns, including comma, space, and comma-space, but none of these cause aws to detect my schema properly. At one point I gave up and edited the schema manually, which caused this error:
While I know that I have the option to keep the schema as a single column and use string and date-time manipulation to structure my data, I prefer not to do it this way... Any suggestions?

While I wasn't able to get the schema discovery tool to work, I realized that I am able to manually edit my schema and it works fine. I was getting that error because I had just populated the stream initially, and I was not continuously sending data.

Schema Discovery required me to send data to my input kinesis stream during the schema discovery. To do this for my Proof of Concept application I used the AWS CLI:
# emittokinesis.sh
JSON='{
"messageId": "31c14ee7-9bde-484d-af05-03509c2c33aa",
"myTest": "myValue"
}'
echo "$JSON"
JSONBASE64=$(echo ${JSON} | base64)
echo 'aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "'${JSONBASE64}'"'
aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "${JSONBASE64}"
I clicked the "Run Schema Discovery" button in the AWS UI and then quickly ran my shell script in a CMD window.
Once my initial schema was discovered I could manually edit the schema but it mostly matched what I expected based on my input JSON.

Related

Why do I get ThrottlingException - Rate Exceeded status:400 when making AWS Athena API call from API server?

We have an S3 data lake in AWS (with Lake Formation, Glue etc.) The end goal is to query the S3 data sources using SQL in Athena.
When making the query in the AWS Athena console - everything works
fine, results are provided,see screenshot: https://share.getcloudapp.com/NQuNBr5g
When making the query through the official API
application domain (Symfony5 RESTful api that uses aws-sdk-php vendor), the
query doesn't even get to Athena, error returned is 400: https://share.getcloudapp.com/xQuqQLrq
in CloudTrail events, I can only see
errorcode= ThrottlingException and errormessage='Rate exceeded', there's no query execution id.
Weird thing I don't get is, when making the same call in my localhost setup of the API app, the call is again successful: https://share.getcloudapp.com/jkuv8ZGy
The call made is StartQueryExecution on Athena API, error as shown on the API app's side:
Error executing \"GetQueryExecution\" on \"https://athena.us-west-2.amazonaws.com\"; AWS HTTP error: Client error: `POST https://athena.us-west-2.amazonaws.com` resulted in a `400 Bad Request` response:\n{\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}\n ThrottlingException (client): Rate exceeded - {\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}", "class": "Aws\\Athena\\Exception\\AthenaException"
The API app server and the datalake etc. are on the same VPC, and I created a VPC endpoint from the server's VPC to athena us-west-2 endpoint, but it didn't help. I don't think it's Athena Quota issues, since on localhost the query works just fine. Any insight would be very helpful, thank you!

The solution was a combination of actions. Athena just doesn't work like that. So it's not okay to expect data from an Athena query over an S3 datalake as if querying a relational database. What helped get results consistently and not have this error was:
update the PHP SDK AthenaClient constructor, and also pass config for retries.
... other AthenaClient constructor params...
'retries' => [
'mode' => 'standard',
'max_attempts' => 3
],
Athena and other elastic services (e.g. dynamodb) work asynchronously. You issue the query, but the result will not be delivered synchronously. As example - I saw in my early tests always receiving the initial "throttlingException" but in Athena Query console, the result of that exact same query came slightly later, but successfully. It looks like the PHP SDK for aws is done with this in mind so doing retries and exponential backoff is also what AWS recommends: https://docs.aws.amazon.com/general/latest/gr/api-retries.html
Partition your data, and in a relevant way, in order to scan as less data as possible. Which helps with more consistent and faster results. - https://docs.aws.amazon.com/athena/latest/ug/partitions.html // either on the glue table directly, or via Glue ETL job where partitioning keys are specified. If your query on athena is looking for something where country={country}, a good partitioning scheme is per country.
avoid 'select *' - always name exactly the columns needed + add limit + queries over Athena should be relatively simple select queries, if you need joins or other more complex query types, Redshift is better suited for that.

Athena / Presto table on top of multiple json in one line

I am looking for ways to read Lambda logs that gets shipped to S3 via Athena. We use central logging as explained here, with minor variations. The flow is: Lambda -> CloudWatch -> Subscription filter -> Kinesis Stream -> Kinesis Firehose -> S3.
https://aws.amazon.com/solutions/implementations/centralized-logging/
The logs fall in a multiple JSON per line format. It is always one line and multiple JSON concatenated. No delimiter between the two json.
{"messageType":"DATA_MESSAGE","owner":"xxxx","logGroup":"/aws/lambda/xxx","logStream":"2021/07/16/[4]...... }]}{"messageType":"DATA_MESSAGE","owner":"xxxx","logGroup":"/aws/lambda/xxx","logStream":"2021/07/16/[4]......}]}
I tried to use org.openx.data.jsonserde.JsonSerDe to read this JSON. But, as you would imagine it only parses out the first JSON. The second JSON that is concatenated to the first is totally ignored. No errors from the DDL, it just silently ignores.
https://github.com/yuki-mt/scripts/blob/a73076cbb3062dae2cf30ddbbe68f5293738264d/aws/athena/sample_queries.sql
Curious to see if anyone has encountered this requirement and solved it with a smart DDL? Not a lot of room to re-format the logs or look at different solutions. Restricted with S3 and Athena.
Thanks for your time.

AWS Glue crawling JSON lines data in S3

I have this type of data in my S3:
{"version":"0","id":"c1d9e9a4-25a2-a0d8-2fa4-b062efec98c4","detail-type":"OneTypeee","source":"OneSource","account":"123456789","time":"2021-01-17T12:35:17Z","region":"eu-central-1","resources":[],"detail":{"Key1":"Value1"}}
{"version":"0","id":"c13879a4-2h32-a0d8-9m33-b03jsh3cxxj4","detail-type":"OtherType","source":"SomeMagicSource","account":"123456789","time":"2021-01-17T12:36:17Z","region":"eu-central-1","resources":[],"detail":{"Key2":"Value2", "Key22":"Value22"}}
{"version":"0","id":"gi442233-3y44a0d8-9m33-937rjd74jdddj","detail-type":"MoreTypes","source":"SomeMagicSource2","account":"123456789","time":"2021-01-17T12:45:17Z","region":"eu-central-1","resources":[],"detail":{"MagicKey":"MagicValue", "Foo":"Bar"}}
Please note, I have added new lines to make it more readable. In reality, Kinesis Firehose produces these batches with no newlines.
When I try to run an AWS Glue crawler on this type of data, it only crawls the first JSON line and that's it. I know this because when I run Athena SQL queries, I always get only one (first) result.
How do I make a glue crawler correctly crawl through this data and make a correct schema so I could query all of that data?

I wasn't able to run a crawler through JSON lines data, but simply specifying in the Glue Table Serde properties that the data is JSON worked for me. Glue automatically splits the JSON by newline and I can query the data in my Glue Jobs.
Here's what my table's properties look like. Additionally, my json lines data was compressed, so here you can ignore the compressionType property.

I had the same issue and for me the reason was that json records were being written to S3 bucket without next line character: \n.
Make sure your json records are written with \n appended at the end. In case of java, something like this:
PutRecordRequest request = new PutRecordRequest()
.withRecord(new Record().withData(ByteBuffer.wrap((json + "\n").getBytes())))
.withDeliveryStreamName(streamName);
amazonKinesis.putRecordAsync(request);

Upload multi-lined JSON log to AWS CloudWatch Log

The put-log-events expect the JSON file need to wrap by a [ & ]
e.g.
# aws logs put-log-events --log-group-name my-logs --log-stream-name 20150601 --log-events file://events
[
{
"timestamp": long,
"message": "string"
}
...
]
However, my JSON file is in multi-lined format like
{"timestamp": xxx, "message": "xxx"}
{"timestamp": yyy, "message": "yyy"}
Is it possible to upload without writing my own program?
[1] https://docs.aws.amazon.com/cli/latest/reference/logs/put-log-events.html#examples

An easy way to handle publish the batch without any coding would be by using jq to do the necessary transformation in the file. jq is a command line utility to do the JSON processing.
cat events | jq -s '.'> events-formatted.json
aws logs put-log-events --log-group-name my-logs --log-stream-name 20150601 --log-events file://events-formatted.json
With this the data should be formatted and could be ingested to CloudWatch.

If you want to keep those lines as a single event, you can cast the lines to string, join them with \n and send them that way.
Since lines look like self sufficient json themselves, sending them as an array of events (hence [...]) might not be that bad, since they will get into same log group and will be easy to find as a batch.

You will need to escape it as suggested, and remove the new lines. Even though there is allot of JSON these days used as the consumer format, it isn't a great raw representation when it comes to logs. Reason being is that logs can get truncated.
Try parsing truncated JSON, no fun at all!
You also don't want to have timestamp embedded in your logs either, this will break the filter and search ability that you get with cloudwatch.
You can stream a RAW format to cloudwatch logs, and then use streams to parse that raw data, format it, filter it or whatever you want to do, into a service such as Elastic Search. I would recommend streaming to Elastic Search service on AWS if you are wanting to do more with your logs than what cloudwatch gives you, and you can do your embedded timestamp format as well if you so wish.

Upload log4j2 JSON formatted log files to AWS CloudWatch with correct timestamp

I'm trying to upload log files to AWS CloudWatch. The application is outputting log4j2 style JSON into a file:
https://logging.apache.org/log4j/2.x/manual/layouts.html#JSONLayout
AWS provide 2 cloudwatch log agents for this task. An 'older' agent, and a 'unified' agent:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_GettingStarted.html
I have tried using both agents, but run into the following problems, mainly related to parsing timestamps and the fact that the agent is doing a regex match on the entire log line and is not parsing the it as JSON:
An example log message (additional fields omitted):
{"message": "Processing data for previous day: 2019-06-17T02:01:00", "timestamp": "2019-06-18T17:16:19.338000+0100"}
The older agent threw an exception because it was attempting to use my configured timestamp format to parse the timestamp in the message, not the one in the timestamp field.
The unified agent is unable to parse timestamps with sub-second precision which results in problems when attempting to combine log streams from multiple sources.
So, is there a better tool/strategy to upload JSON formatted log4j2 files to CoudWatch?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js