Athena / Presto table on top of multiple json in one line - amazon-athena

I am looking for ways to read Lambda logs that gets shipped to S3 via Athena. We use central logging as explained here, with minor variations. The flow is: Lambda -> CloudWatch -> Subscription filter -> Kinesis Stream -> Kinesis Firehose -> S3.
https://aws.amazon.com/solutions/implementations/centralized-logging/
The logs fall in a multiple JSON per line format. It is always one line and multiple JSON concatenated. No delimiter between the two json.
{"messageType":"DATA_MESSAGE","owner":"xxxx","logGroup":"/aws/lambda/xxx","logStream":"2021/07/16/[4]...... }]}{"messageType":"DATA_MESSAGE","owner":"xxxx","logGroup":"/aws/lambda/xxx","logStream":"2021/07/16/[4]......}]}
I tried to use org.openx.data.jsonserde.JsonSerDe to read this JSON. But, as you would imagine it only parses out the first JSON. The second JSON that is concatenated to the first is totally ignored. No errors from the DDL, it just silently ignores.
https://github.com/yuki-mt/scripts/blob/a73076cbb3062dae2cf30ddbbe68f5293738264d/aws/athena/sample_queries.sql
Curious to see if anyone has encountered this requirement and solved it with a smart DDL? Not a lot of room to re-format the logs or look at different solutions. Restricted with S3 and Athena.
Thanks for your time.

Related

Is there a way to deal with changes in log schema?

I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.

AWS Glue crawling JSON lines data in S3

I have this type of data in my S3:
{"version":"0","id":"c1d9e9a4-25a2-a0d8-2fa4-b062efec98c4","detail-type":"OneTypeee","source":"OneSource","account":"123456789","time":"2021-01-17T12:35:17Z","region":"eu-central-1","resources":[],"detail":{"Key1":"Value1"}}
{"version":"0","id":"c13879a4-2h32-a0d8-9m33-b03jsh3cxxj4","detail-type":"OtherType","source":"SomeMagicSource","account":"123456789","time":"2021-01-17T12:36:17Z","region":"eu-central-1","resources":[],"detail":{"Key2":"Value2", "Key22":"Value22"}}
{"version":"0","id":"gi442233-3y44a0d8-9m33-937rjd74jdddj","detail-type":"MoreTypes","source":"SomeMagicSource2","account":"123456789","time":"2021-01-17T12:45:17Z","region":"eu-central-1","resources":[],"detail":{"MagicKey":"MagicValue", "Foo":"Bar"}}
Please note, I have added new lines to make it more readable. In reality, Kinesis Firehose produces these batches with no newlines.
When I try to run an AWS Glue crawler on this type of data, it only crawls the first JSON line and that's it. I know this because when I run Athena SQL queries, I always get only one (first) result.
How do I make a glue crawler correctly crawl through this data and make a correct schema so I could query all of that data?
I wasn't able to run a crawler through JSON lines data, but simply specifying in the Glue Table Serde properties that the data is JSON worked for me. Glue automatically splits the JSON by newline and I can query the data in my Glue Jobs.
Here's what my table's properties look like. Additionally, my json lines data was compressed, so here you can ignore the compressionType property.
I had the same issue and for me the reason was that json records were being written to S3 bucket without next line character: \n.
Make sure your json records are written with \n appended at the end. In case of java, something like this:
PutRecordRequest request = new PutRecordRequest()
.withRecord(new Record().withData(ByteBuffer.wrap((json + "\n").getBytes())))
.withDeliveryStreamName(streamName);
amazonKinesis.putRecordAsync(request);

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

AWS S3 storage and schema

I have an IOT sensor which sends the following message to IoT MQTT Core topic:
{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}
I have added ACT/RULE which stores the incoming message in an S3 Bucket with the timestamp as a key(each message is stored as a seperate file/row in the bucket).
I have only worked with SQL databases before, so having them stored like this is new to me.
1) Is this the proper way to work with S3 storage?
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
"Amazon ML can't retrieve the schema. If you've just created this
datasource, wait a moment and try again."
Appreciate all advice there is!
1) Is this the proper way to work with S3 storage?
With only one sensor, using the [timestamp](https://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html#iot-function-timestamp function in your IoT rule would be a way to name unique objects in S3, but there are issues that might come up.
With more than one sensor, you might have multiple messages arrive at the same timestamp and this would not generate unique object names in S3.
Timestamps from nearly the same time are going to have similar prefixes and designing your S3 keys this way may not give you the best performance at higher message rates.
Since you're using MQTT, you could use the traceId function instead of the timestamp to avoid these two issues if they come up.
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
For the third question, I think you could be running into a data format problem in ML because your S3 objects contain the JSON data from your messages and not a CSV.
For the second question, I think you're trying to combine message data from successive messages into a CSV, or at least output the message data as a single line of a CSV file. I don't think this is possible with just the Iot SQL language since it's intended to produce JSON.
One alternative is to configure your IoT SQL rule with a Lambda action and use a lambda function to make your JSON to CSV conversion and then write the CSV to your S3 bucket. If you go this direction, you may have to enrich your IoT message data with the timestamp (or traceId) as you call the lambda.
A rule like select timestamp() as timestamp, traceid() as traceid, concat(ID1, ID2, ID3, ValueMax) as values, * as message would produce a JSON like
{"timestamp":1538606018066,"traceid":"abab6381-c369-4a08-931d-c08267d12947","values":[10001,1001,101,123],"message":{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}}
That would be straightforward to use as the source for a CSV row with the data from its values property.