I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.
Related
I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.
I would like to ask you about getting some advice about handling many application events on AWS. My application sends a lot of different events about everything what a user did in real-time. For collecting those events, I’m using AWS firehose (kinesis) - I have few data streams where I push some different events. Some events, before storing on S3/Redshift contains data which I want to extract and store to other databases (DynamoDB) or to other S3 files — for that case I’m using lambda which is assigned to a specific stream.
My problem is that business adds more and more new events which they need to collect or do something with data and for every new event or „group” events I need create separate data stream + s3/rs/es + lambda for extracting data. Also, events on S3 are stored in one format and there is not possible to group that events e.g. by userId from an application or even name of the event in the stream filename. Ideal s3 with that events would look like events/{user_id}/{date}/{event-name}{timestamp}.json.
Maybe I’m wrong using firehose or I have wrong thinking about firehose in my case, maybe there are other, better services on AWS for my case which can give me more control. Maybe simple SQS + lambdas as a listener on S3 is better solution in this case?
Thanks for any advice.
EDIT 12th Nov 2020
This was supposed to be a comment for #Lina, but it was too long to put a comment, so I updated my question with the solution which I pick.
I resolved my issue as I "felt", so it may not be a good way to repeat, but: I've written a nodejs routing application which I connected on firehose and I wrote a few microservices where data is sent from firehose by my routing app. So now, I have a firehose tube and I'm taking 10 different event types. When some event came, my routing application decides what microservice should be run with what data based on the event type (the raw firehose event is still stored on s3 automatically). This gives me needed flexibility as I can extract specific data from the event, do with that data what I need, by running every other microservices from the whole system and still have a raw event in the s3 in case of needed revert history of events.
Some of the events are not passing to any service, it is just stored as a raw s3 file e.g. application logs - I can do many things with that files on S3 PUT/CREATE event.
I hope that it will help someone with a similar problem.
We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.
I have an IOT sensor which sends the following message to IoT MQTT Core topic:
{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}
I have added ACT/RULE which stores the incoming message in an S3 Bucket with the timestamp as a key(each message is stored as a seperate file/row in the bucket).
I have only worked with SQL databases before, so having them stored like this is new to me.
1) Is this the proper way to work with S3 storage?
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
"Amazon ML can't retrieve the schema. If you've just created this
datasource, wait a moment and try again."
Appreciate all advice there is!
1) Is this the proper way to work with S3 storage?
With only one sensor, using the [timestamp](https://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html#iot-function-timestamp function in your IoT rule would be a way to name unique objects in S3, but there are issues that might come up.
With more than one sensor, you might have multiple messages arrive at the same timestamp and this would not generate unique object names in S3.
Timestamps from nearly the same time are going to have similar prefixes and designing your S3 keys this way may not give you the best performance at higher message rates.
Since you're using MQTT, you could use the traceId function instead of the timestamp to avoid these two issues if they come up.
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
For the third question, I think you could be running into a data format problem in ML because your S3 objects contain the JSON data from your messages and not a CSV.
For the second question, I think you're trying to combine message data from successive messages into a CSV, or at least output the message data as a single line of a CSV file. I don't think this is possible with just the Iot SQL language since it's intended to produce JSON.
One alternative is to configure your IoT SQL rule with a Lambda action and use a lambda function to make your JSON to CSV conversion and then write the CSV to your S3 bucket. If you go this direction, you may have to enrich your IoT message data with the timestamp (or traceId) as you call the lambda.
A rule like select timestamp() as timestamp, traceid() as traceid, concat(ID1, ID2, ID3, ValueMax) as values, * as message would produce a JSON like
{"timestamp":1538606018066,"traceid":"abab6381-c369-4a08-931d-c08267d12947","values":[10001,1001,101,123],"message":{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}}
That would be straightforward to use as the source for a CSV row with the data from its values property.
Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.